# Recent Publications

## (2016 to Present)

### John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, Hadia Ahmed,"UPC++: A High-Performance Communication Framework for Asynchronous Computation",33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'19),Rio de Janeiro, Brazil,IEEE,May 2019,doi: 10.25344/S4V88H

UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC).
We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x.
UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.

### Daniel F. Martin, Stephen L. Cornford, Antony J. Payne,"Millennial‐scale Vulnerability of the Antarctic Ice Sheet to Regional Ice Shelf Collapse",Geophysical Research Letters,January 9, 2019,doi: 10.1029/2018gl081229

Abstract:

The Antarctic Ice Sheet (AIS) remains the largest uncertainty in projections of future sea level rise. A likely climate‐driven vulnerability of the AIS is thinning of floating ice shelves resulting from surface‐melt‐driven hydrofracture or incursion of relatively warm water into subshelf ocean cavities. The resulting melting, weakening, and potential ice‐shelf collapse reduces shelf buttressing effects. Upstream ice flow accelerates, causing thinning, grounding‐line retreat, and potential ice sheet collapse. While high‐resolution projections have been performed for localized Antarctic regions, full‐continent simulations have typically been limited to low‐resolution models. Here we quantify the vulnerability of the entire present‐day AIS to regional ice‐shelf collapse on millennial timescales treating relevant ice flow dynamics at the necessary ∼1km resolution. Collapse of any of the ice shelves dynamically connected to the West Antarctic Ice Sheet (WAIS) is sufficient to trigger ice sheet collapse in marine‐grounded portions of the WAIS. Vulnerability elsewhere appears limited to localized responses.

Plain Language Summary:

The biggest uncertainty in near‐future sea level rise (SLR) comes from the Antarctic Ice Sheet. Antarctic ice flows in relatively fast‐moving ice streams. At the ocean, ice flows into enormous floating ice shelves which push back on their feeder ice streams, buttressing them and slowing their flow. Melting and loss of ice shelves due to climate changes can result in faster‐flowing, thinning and retreating ice leading to accelerated rates of global sea level rise.To learn where Antarctica is vulnerable to ice‐shelf loss, we divided it into 14 sectors, applied extreme melting to each sector's floating ice shelves in turn, then ran our ice flow model 1000 years into the future for each case. We found three levels of vulnerability. The greatest vulnerability came from attacking any of the three ice shelves connected to West Antarctica, where much of the ice sits on bedrock lying below sea level. Those dramatic responses contributed around 2m of sea level rise. The second level came from four other sectors, each with a contribution between 0.5‐1m. The remaining sectors produced little to no contribution. We examined combinations of sectors, determining that sectors behave independently of each other for at least a century.

### Paul H. Hargrove, Dan Bonachea,"GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network",Parallel Applications Workshop, Alternatives To MPI (PAW-ATM),Dallas, Texas, USA,November 16, 2018,doi: 10.25344/S44S38

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models on future exascale machines. This paper reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as "specializations", primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth is increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are all available in GASNet-EX version 2018.3.0 and later.

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runtimes",The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'18),November 13, 2018,

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work is driven by the emerging need for adaptive, lightweight communication in irregular applications at exascale. We present an overview of UPC++ and GASNet-EX, including examples and performance results.

GASNet-EX is a portable, high-performance communication library, leveraging hardware support to efficiently implement Active Messages and Remote Memory Access (RMA). UPC++ provides higher-level abstractions appropriate for PGAS programming such as: one-sided communication (RMA), remote procedure call, locality-aware APIs for user-defined distributed objects, and robust support for asynchronous execution to hide latency. Both libraries have been redesigned relative to their predecessors to meet the needs of exascale computing. While both libraries continue to evolve, the system already demonstrates improvements in microbenchmarks and application proxies.

### Dan Bonachea, Paul H. Hargrove,"GASNet-EX: A High-Performance, Portable Communication Library for Exascale",Languages and Compilers for Parallel Computing (LCPC'18),Salt Lake City, Utah, USA,October 11, 2018,LBNL 2001174, doi: 10.25344/S4QP4W

Partitioned Global Address Space (PGAS) models, typified by such languages as Unified Parallel C (UPC) and Co-Array Fortran, expose one-sided communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in future exascale machines. The library is an evolution of the popular GASNet communication system, building upon over 15 years of lessons learned. We describe and evaluate several features and enhancements that have been introduced to address the needs of modern client systems. Microbenchmark results demonstrate the RMA performance of GASNet-EX is competitive with several MPI-3 implementations on current HPC systems.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer's Guide, v1.0-2018.9.0",Lawrence Berkeley National Laboratory Tech Report,September 26, 2018,LBNL 2001180, doi: 10.25344/S49G6V

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Specification v1.0, Draft 8",Lawrence Berkeley National Laboratory Tech Report,September 26, 2018,LBNL 2001179, doi: 10.25344/S45P4X

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Dan Bonachea, Paul Hargrove,"GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network",Lawrence Berkeley National Laboratory Tech Report,March 27, 2018,LBNL 2001134, doi: 10.2172/1430690

This document is a deliverable for milestone STPM17-6 of the Exascale Computing Project, delivered by WBS 2.3.1.14. It reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as “specializations”, primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth is increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are available in the GASNet-EX 2018.3.0 release.

### N. Sanderson, E. Shugerman, S. Molnar, J. Meiss E. Bradley,"Computational Topology Techniques for Characterizing Time-Series Data",Advances in Intelligent Data Analysis XVI 16th International Symposium, IDA 2017, London, UK, October 26–28, 2017, Proceedings,October 2017,pp.284-296,doi: 10.1007/978-3-319-68765-0_24

Topological data analysis (TDA), while abstract, allows a characterization of time-series data obtained from nonlinear and complex dynamical systems. Though it is surprising that such an abstract measure of structure—counting pieces and holes—could be useful for real-world data, TDA lets us compare different systems, and even do membership testing or change-point detection. However, TDA is computationally expensive and involves a number of free parameters. This complexity can be obviated by coarse-graining, using a construct called the witness complex. The parametric dependence gives rise to the concept of persistent homology: how shape changes with scale. Its results allow us to distinguish time-series data from different systems—e.g., the same note played on different musical instruments.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer’s Guide, v1.0-2017.9",Lawrence Berkeley National Laboratory Tech Report,September 29, 2017,LBNL 2001065, doi: 10.2172/1398522

This document has been superseded by: UPC++ Programmer’s Guide, v1.0-2018.3.0 (LBNL-2001136)

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### Dan Bonachea, Paul Hargrove,"GASNet Specification, v1.8.1",Lawrence Berkeley National Laboratory Tech Report,August 31, 2017,LBNL 2001064, doi: 10.2172/1398512

GASNet is a language-independent, low-level networking layer that provides network-independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages and libraries such as UPC, UPC++, Co-Array Fortran, Legion, Chapel, and many others. The interface is primarily intended as a compilation target and for use by runtime library writers (as opposed to end users), and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking".

### Dongeun Lee, Alex Sim, Jaesik Choi, Kesheng Wu,"Improving Statistical Similarity Based Data Reduction for Non-Stationary Data",29th International Conference on Scientific and Statistical Database Management (SSDBM2017),2017,doi: 10.1145/3085504.3085583

Updated experiment version: https://sdm.lbl.gov/oapapers/ssdbm17-lee-upd.pdf
Original version: http://dl.acm.org/citation.cfm?doid=3085504.3085583

### Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli,"Efficient Programming for Multicore Processor Heterogeneity: OpenMP versus OmpSs",Open Source Supercomputing Workshop,Frankfurt, Germany,Springer’s Lecture Notes in Computer Science (LNCS),June 22, 2017,

ARM single-ISA heterogeneous multicore processors combine high-performance big cores with power-efficient small cores. They aim at achieving a suitable balance between performance and energy. How- ever, a main challenge is to program such architectures so as to efficiently exploit their features. In this paper, we study the impact on performance and energy trade-offs of single-ISA architecture according to OpenMP 3.0 and the OmpSs programming models. We consider different symmetric/asymmetric architecture configura- tions in terms of core frequency and core count between big and LITTLE clusters. Experiments are conducted on both a real Samsung Exynos 5 Octa system-on-chip and the gem5/McPAT simulation frameworks. Results show that OmpSs implementations are more sensitive to loop scheduling parameters than OpenMP 3.0. In most cases, best OmpSs configurations significantly outperform OpenMP ones. While cluster frequency asym- metry provides uninteresting results, asymmetric cluster configuration with single high-performance core and multiple low-power cores provides better performance/energy trade-offs in many cases.

### Dilip Vasudevan, Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf,"Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies",Workshop on HPC computing in a Post Moore’s law world (HCPM),June 22, 2017,

With the decline and eventual end of historical rates of lithographic scaling, we arrive at a crossroad where synergistic and holistic decisions are required to preserve Moore's law technology scaling. Numerous emerging technologies aim to extend digital electronics scaling of performance, energy efficiency, and computational power/density,
ranging from devices (transistors), memories, 3D integration capabilities, specialized architectures, photonics, and others.
The wide range of technology options creates the need for an integrated strategy to understand the impact of these emerging technologies on future large-scale digital systems for diverse application requirements and optimization metrics.
In this paper, we argue for a comprehensive methodology that spans the different levels of abstraction -- from materials, to devices, to complex digital systems and applications. Our approach integrates compact models of low-level characteristics of the emerging technologies to inform higher-level simulation models to evaluate their responsiveness to application requirements.
The integrated framework can then automate the search for an optimal architecture using available emerging technologies to maximize a targeted optimization metric.

### SM Martin, MJ Berger, SB Baden,"Toucan-A Translator for Communication Tolerant MPI Applications",Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017,June 2017,998-1007,doi: 10.1109/IPDPS.2017.44

We discuss early results with Toucan, a source-to-source  translator that automatically restructures C/C++ MPI applications to overlap communication with computation. We co-designed the translator and runtime system to enable dynamic, dependence-driven execution of MPI applications, and require only a modest amount of programmer annotation. Co-design was essential to realizing overlap through dynamic code block reordering and avoiding the limitations of static code relocation and inlining. We demonstrate that Toucan hides significant communication in four representative applications running on up to 24K cores of NERSC's Edison platform. Using Toucan, we have hidden from 33% to 85% of the communication overhead, with performance meeting or exceeding that of painstakingly hand-written overlap variants. © 2017 IEEE.

### Dai Wang, Junyu Gaob, Pan Li, Bin Wang, Cong Zhang, Samveg Saxena,"Modeling of plug-in electric vehicle travel patterns and charging load based on trip chain generation",Journal of Power Sources,May 13, 2017,359:468 - 479,doi: 10.1016/j.jpowsour.2017.05.036

Modeling PEV travel and charging behavior is the key to estimate the charging demand and further explore the potential of providing grid services. This paper presents a stochastic simulation methodology to generate itineraries and charging load profiles for a population of PEVs based on real-world vehicle driving data. In order to describe the sequence of daily travel activities, we use the trip chain model which contains the detailed information of each trip, namely start time, end time, trip distance, start location and end location. A trip chain generation method is developed based on the Naive Bayes model to generate a large number of trips which are temporally and spatially coupled. We apply the proposed methodology to investigate the multi-location charging loads in three different scenarios. Simulation results show that home charging can meet the energy demand of the majority of PEVs in an average condition. In addition, we calculate the lower bound of charging load peak on the premise of lowest charging cost. The results are instructive for the design and construction of charging facilities to avoid excessive infrastructure.

### Yingqi Xiong, Bin Wang, Chi-cheng Chu, Rajit Gadh,"Distributed Optimal Vehicle Grid Integration Strategy with User Behavior Prediction",IEEE PES General Meeting 2017,March 13, 2017,

With the increasing of electric vehicle (EV) adoption in recent years, the impact of EV charging activities to the power grid becomes more and more significant. In this article, an optimal scheduling algorithm which combines smart EV charging and V2G gird service is developed to integrate EVs into power grid as distributed energy resources, with improved system cost performance. Specifically, an optimization problem is formulated and solved at each EV charging station according to control signal from aggregated control center and user charging behavior prediction by mean estimation and linear regression. The control center collects distributed optimization results and updates the control signal, periodically. The iteration continues until it converges to optimal scheduling. Experimental result shows this algorithm helps fill the valley and shave the peak in electric load profiles within a microgrid, while the energy demand of individual driver can be satisfied.

### Yubo Wan, Wenbo Shi, Bin Wang, Chi-Cheng Ch, Rajit Gadh,"Optimal operation of stationary and mobile batteries in distribution grids",Applied Energy,January 28, 2017,190:1289 - 130,doi: 10.1016/j.apenergy.2016.12.139

The trending integrations of Battery Energy Storage System (BESS, stationary battery) and Electric Vehicles (EV, mobile battery) to distribution grids call for advanced Demand Side Management (DSM) technique that addresses the scalability concerns of the system and stochastic availabilities of EVs. Towards this goal, a stochastic DSM is proposed to capture the uncertainties in EVs. Numerical approximation is then used to make the problem tractable. To accelerate the computational speed, the proposed DSM is tightly relaxed to a convex form using second-order cone programming. Furthermore, in light of the continuous increasing problem size, a distributed method with a guaranteed convergence is applied to shift the centralized computational burden to distributed controllers. To verify the proposed DSM, real-life EV data collected on UCLA campus is used to test the proposed DSM in an IEEE benchmark test system. Numerical results demonstrate the correctness and merits of the proposed approach.

### John Bachan, Dan Bonachea, Paul H Hargrove, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Scott B Baden,"The UPC++ PGAS library for Exascale Computing",Proceedings of the Second Annual PGAS Applications Workshop,January 1, 2017,7,

We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.

### Yingqi Xiong, Bin Wang, Zhiyuan Cao, Chi-cheng Chu, Hemanshu Pota, Rajit Gadh,"Extension of IEC61850 with smart EV charging",Innovative Smart Grid Technologies - Asia (ISGT-Asia), 2016 IEEE,Melbourne, VIC, Australia,IEEE,December 26, 2016,294 - 299,doi: 10.1109/ISGT-Asia.2016.7796401

Un-coordinated Electric Vehicle (EV) charging behaviors will impact the power grid and degrade power quality. Smart charging system can optimally allocate the energy among EVs and minimize the impact on grid. IEC 61850 is an international standard for distribution & substation automation and intelligence device communication in smart grid. This paper introduces the extension of IEC 61850 with smart EV scheduling algorithms, considering current multiplexing, power sharing strategies and user behaviors. We have developed an IEC 61850 abstract data model for the information exchanged among components of smart EV charging infrastructures, including the EV charging control center, intelligent Electric Vehicle Supply Equipment (EVSE) and the EV users with mobile applications. Real-world EV usage data on UCLA campus is utilized in the simulation experiment, which is based on the predictive control paradigm. The data model is instantiated by a web service on EV control center by converting the raw data streams from EVSEs in various formats, such as JSON or raw string, etc. into a standardized IEC 61850 SCL file, which contains the critical meter data and charging session parameters. The system cost performance, the interoperability and scalability of smart EV charging infrastructure are greatly improved.

### E. Vecharynski, A. Knyazev,"Preconditioned steepest descent-like methods for symmetric indefinite systems",Linear Algebra and its Applications, Vol. 511, pp. 274–295,2016,

We construct preconditioned steepest descent (PSD)-like methods for iterative solution of symmetric indefinite linear systems using symmetric and positive definite (SPD) preconditioners. Our construction is based on a locally optimal residual minimization over two-dimensional subspaces, mathematically equivalent in exact arithmetic to preconditioned MINRES (PMINRES) restarted after every two steps. A convergence bound is derived. If certain information on the spectrum of the preconditioned system is available, we present a simpler PSD-like algorithm that performs only one-dimensional residual minimization. Search direction randomization for accelerating this algorithm is discussed. Our primary goal is to bridge the theoretical gap between the optimal (PMINRES) and PSD-like methods for solving symmetric indefinite systems. We also demonstrate situations where the suggested PSD-like schemes can be preferable to the optimal PMINRES iteration.

### S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer,"A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering",Journal of Applied Crystallography,December 2016,49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

Best Paper Award

### Bin Wang, Yubo Wang, Hamidreza Nazaripouya, Charlie Qiu, Chi-Cheng Chu, Rajit Gadh,"Predictive Scheduling Framework for Electric Vehicles with Uncertainties of User Behaviors",IEEE Internet of Things Journal,October 13, 2016,4:52 - 63,doi: 10.1109/JIOT.2016.2617314

The randomness of user behaviors plays a significant role in electric vehicle (EV) scheduling problems, especially when the power supply for EV supply equipment (EVSE) is limited. Existing EV scheduling methods do not consider this limitation and assume charging session parameters, such as stay duration and energy demand values, are perfectly known, which is not realistic in practice. In this paper, based on real-world implementations of networked EVSEs on University of California at Los Angeles campus, we developed a predictive scheduling framework, including a predictive control paradigm and a kernel-based session parameter estimator. Specifically, the scheduling service periodically computes for cost-efficient solutions, considering the predicted session parameters, by the adaptive kernel-based estimator with improved estimation accuracies. We also consider the power sharing strategy of existing EVSEs and formulate the virtual load constraint to handle the future EV arrivals with unexpected energy demand. To validate the proposed framework, 20-fold cross validation is performed on the historical dataset of charging behaviors for over one-year period. The simulation results demonstrate that average unit energy cost per kWh can be reduced by 29.42% with the proposed scheduling framework and 66.71% by further integrating solar generations with the given capacity, after the initial infrastructure investment. The effectiveness of kernel-based estimator, virtual load constraint, and event-based control scheme are also discussed in detail.

### Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, David Novo, Lionel Torres, Michel Robert,"Full-System Simulation of big. LITTLE Multicore Architecture for Performance and Energy Exploration",Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2016 IEEE 10th International Symposium on,Lyon, France,IEEE,September 21, 2016,doi: 10.1109/MCSoC.2016.20

Single-ISA heterogeneous multicore processors have gained increasing popularity with the introduction of recent technologies such as ARM big.LITTLE. These processors offer increased energy efficiency through combining low power in-order cores with high performance out-of-order cores. Efficiently exploiting this attractive feature requires careful management so as to meet the demands of targeted applications. In this paper, we explore the design of those architectures based on the ARM big.LITTLE technology by modeling performance and power in gem5 and McPAT frameworks. Our models are validated w.r.t. the Samsung Exynos 5 Octa (5422) chip. We show average errors of 20% in execution time, 13% for power consumption and 24% for energy-to-solution.

### R. Li, Y. Xi, E. Vecharynski, C. Yang, and Y. Saad,"A Thick-Restart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems",SIAM Journal on Scientific Computing, Vol. 38, Issue 4, pp. A2512–A2534,2016,doi: 10.1137/15M1054493

Polynomial filtering can provide a highly effective means of computing all eigenvalues of a real symmetric (or complex Hermitian) matrix that are located in a given interval, anywhere in the spectrum. This paper describes a technique for tackling this problem by combining a Thick-Restart version of the Lanczos algorithm with deflation ('locking') and a new type of polynomial filters obtained from a least-squares technique. The resulting algorithm can be utilized in a 'spectrum-slicing' approach whereby a very large number of eigenvalues and associated eigenvectors of the matrix are computed by extracting eigenpairs located in different sub-intervals independently from one another.

### Rafael Garibotti, Anastasiia Butko, Luciano Ost, Abdoulaye Gamatié, Gilles Sassatelli, Chris Adeniyi-Jones,"Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures",IEEE Transactions on Computers,August 1, 2016,doi: 10.1109/TC.2015.2485202

A large portion of existing multithreaded embedded software has been programmed according to symmetric shared memory platforms where a monolithic memory block is shared by all cores. Such platforms accommodate popular parallel programming models such as POSIX threads and OpenMP. However with the growing number of cores in modern manycore embedded architectures, they present a bottleneck related to their centralized memory accesses. This paper proposes a solution tailored for an efficient execution of applications defined with shared-memory programming models onto on-chip distributed-memory multicore architectures. It shows how performance, area and energy consumption are significantly improved thanks to the scalability of these architectures. This is illustrated in an open-source realistic design framework, including tools from ASIC to microkernel.

### Bin Wang, Rui Huang, Yubo Wang, Hamidreza Nazaripouya, Charlie Qiu, Chi-Cheng Chu, Rajit Gadh,"Predictive Scheduling for Electric Vehicles Considering Uncertainty of Load and User Behaviors",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7520018

Un-coordinated Electric Vehicle (EV) charging can create unexpected load in local distribution grid, which may degrade the power quality and system reliability. The uncertainty of EV load, user behaviors and other baseload in distribution grid, is one of challenges that impedes optimal control for EV charging problem. Previous researches did not fully solve this problem due to lack of real-world EV charging data and proper stochastic model to describe these behaviors. In this paper, we propose a new predictive EV scheduling algorithm (PESA) inspired by Model Predictive Control (MPC), which includes a dynamic load estimation module and a predictive optimization module. The user-related EV load and base load are dynamically estimated based on the historical data. At each time interval, the predictive optimization program will be computed for optimal schedules given the estimated parameters. Only the first element from the algorithm outputs will be implemented according to MPC paradigm. Current-multiplexing function in each Electric Vehicle Supply Equipment (EVSE) is considered and accordingly a virtual load is modeled to handle the uncertainties of future EV energy demands. This system is validated by the real-world EV charging data collected on UCLA campus and the experimental results indicate that our proposed model not only reduces load variation up to 40% but also maintains a high level of robustness. Finally, IEC 61850 standard is utilized to standardize the data models involved, which brings significance to more reliable and large-scale implementation.

### H. Nazaripouya, B. Wang, Y. Wang, P. Chu, H. R. Pota, R. Gadh,"Univariate Time Series Prediction of Solar Power Using a Hybrid Wavelet-ARMA-NARX Prediction Method",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7519959

This paper proposes a new hybrid method for super short-term solar power prediction. Solar output power usually has a complex, nonstationary, and nonlinear characteristic due to intermittent and time varying behavior of solar radiance. In addition, solar power dynamics is fast and is inertia less. An accurate super short-time prediction is required to compensate for the fluctuations and reduce the impact of solar power penetration on the power system. The objective is to predict one step-ahead solar power generation based only on historical solar power time series data. The proposed method incorporates discrete wavelet transform (DWT), Auto-Regressive Moving Average (ARMA) models, and Recurrent Neural Networks (RNN), while the RNN architecture is based on Nonlinear Auto-Regressive models with eXogenous inputs (NARX). The wavelet transform is utilized to decompose the solar power time series into a set of richer-behaved forming series for prediction. ARMA model is employed as a linear predictor while NARX is used as a nonlinear pattern recognition tool to estimate and compensate the error of wavelet-ARMA prediction. The proposed method is applied to the data captured from UCLA solar PV panels and the results are compared with some of the common and most recent solar power prediction methods. The results validate the effectiveness of the proposed approach and show a considerable improvement in the prediction precision.

### Yubo Wang, Bin Wang, Tianyang Zhang, Hamidreza Nazaripouya, Chi-Cheng Chu, Rajit Gadh,"Optimal Energy Management for Microgrid with Stationary and Mobile Storages",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7520004

This paper studies energy management in a Microgrid (MG) with solar generation, Battery Energy Management System (BESS) and gridable (V2G) Electric Vehicles (EVs). A two-stage stochastic optimization method is proposed to capture the intermittent solar generation and random EV user behaviors. It is subsequently formulated as a Mixed Integer Linear Programming (MILP) problem. To evaluate the proposed method, real solar generation, loads, BESS and EV data is used in Sample Average Approximation (SAA). Computational results show the correctness of the proposed method as well as steady and tightly bounded optimality gap. Comparisons demonstrate that the proposed stochastic method outperforms its deterministic counterpart at the expense of higher computational cost. It is also observed that moderate number of EVs helps to reduce the overall operational cost of the MG, which sheds light on future EV integration to the smart grid.

### Yubo Wang, Bin Wang, Rui Huang, Chi-Cheng Chu, Hemanshu R. Pota, Rajit Gadh,"Two-Tier Prediction of Solar Power Generation with Limited Sensing Resource",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7519968

This paper considers a typical solar installations scenario with limited sensing resources. In the literature, there exist either day-ahead solar generation prediction methods with limited accuracy, or high accuracy short timescale methods that are not suitable for applications requiring longer term prediction. We propose a two-tier (global-tier and local-tier) prediction method to improve accuracy for long term (24 hour) solar generation prediction using only the historical power data. In global-tier, we examine two popular heuristic methods: weighted k-Nearest Neighbors (k-NN) and Neural Network (NN). In local-tier, the global-tier results are adaptively updated using real-time analytical residual analysis. The proposed method is validated using the UCLA Microgrid with 35kW of solar generation capacity. Experimental results show that the proposed two-tier prediction method achieves higher accuracy compared to day-ahead predictions while providing the same prediction length. The difference in the overall prediction performance using either weighted k-NN based or NN based in the global-tier are carefully discussed and reasoned. Case studies with a typical sunny day and a cloudy day are carried out to demonstrate the effectiveness of the proposed two-tier predictions.

### Nils E. R. Zimmermann, Maciej Haranczyk,"History and Utility of Zeolite Framework-Type Discovery from a Data-Science Perspective",Crystal Growth & Design,May 2, 2016,16:3043-3048,

Mature applications such as fluid catalytic cracking and hydrocracking rely critically on early zeolite structures. With a data-driven approach, we find that the discovery of exceptional zeolite framework types around the new millennium was spurred by exciting new utilization routes. The promising processes have yet not been successfully implemented (“valley of death” effect), mainly because of the lack of thermal stability of the crystals. This foreshadows limited deployability of recent zeolite discoveries that were achieved by novel crystal synthesis routes.

### Nils E. R. Zimmermann, Maciej Haranczyk,"History and Utility of Zeolite Framework-Type Discovery from a Data-Science Perspective",Crystal Growth & Design,May 2, 2016,

Mature applications such as fluid catalytic cracking and hydrocracking rely critically on early zeolite structures. With a data-driven approach, we find that the discovery of exceptional zeolite framework types around the new millennium was spurred by exciting new utilization routes. The promising processes have yet not been successfully implemented (“valley of death” effect), mainly because of the lack of thermal stability of the crystals. This foreshadows limited deployability of recent zeolite discoveries that were achieved by novel crystal synthesis routes.

Watch a movie illustrating our seeded simulation strategy here.

### George Michelogiannakis, John Shalf, David Donofrio, John Bachan,,"Continuing the Scaling of Digital Computing Post Moore’s Law",LBNL report,April 2016,LBNL 1005126,

The approaching end of traditional CMOS technology scaling that up until now followed Moore's law is coming to an end in the next decade. However, the DOE has come to depend on the rapid, predictable, and cheap scaling of computing performance to meet mission needs for scientific theory, large scale experiments, and national security. Moving forward, performance scaling of digital computing will need to originate from energy and cost reductions that are a result of novel architectures, devices, manufacturing technologies, and programming models. The deeper issue presented by these changes is the threat to DOE’s mission and to the future economic growth of the U.S. computing industry and to society as a whole. With the impending end of Moore’s law, it is imperative for the Office of Advanced Scientific Computing Research (ASCR) to develop a balanced research agenda to assess the viability of novel semiconductor technologies and navigate the ensuing challenges. This report identifies four areas and research directions for ASCR and how each can be used to preserve performance scaling of digital computing beyond exascale and after Moore's law ends.

### J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno,"An efficient basis set representation for calculating electrons in molecules",Journal of Molecular Physics,2016,doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

### Bin Wang, Yubo Wang, Charlie Qiu, Chi-Cheng Chu, Rajit Gadh,"Event-based Electric Vehicle Scheduling Considering Random User Behaviors",2015 IEEE International Conference on Smart Grid Communications (SmartGridComm),Miami, FL, USA,IEEE,March 21, 2016,313-318,doi: 10.1109/SmartGridComm.2015.7436319

Uncontrolled Electric Vehicle (EV) and Plug-in Hybrid Electric Vehicle (PHEV) charging within a local distribution grid may cause unexpected high load, which further results in power quality degradation. However, coordinating charging behaviors of a number of EVs is a challenging task, which involves not only the deterministic schedule computing but also nondeterministic EV driver behaviors with random arrival time and energy demands. Previous researches in this area rarely consider these random behaviors for real EV users. In this paper, an implementable event-based cost optimal scheduling algorithm (ECSA) is developed, which solves EV scheduling problem by dynamically estimating the stay duration and energy demand for each participating EV user. Datasets, including users' historical charging records and time series meter data collected from Electric Vehicle Supply Equipments (EVSEs) in UCLA campus, are utilized for feature extraction. Based on that, proper inference technique is employed to determine parameters within each charging session. In addition, solar generation integration into EVSEs is also considered in our problem formulation. The proposed approaches are tested and validated by real EV charging schedules of users in UCLA campus. The results from simulation experiment demonstrate that the proposed algorithm has a better performance in cost minimization and load shifting compared to existing equal-sharing scheduling algorithm (ESSA).

### E. Vecharynski, C. Yang, and F. Xue,"Generalized preconditioned locally harmonic residual method for non-Hermitian eigenproblems",SIAM Journal on Scientific Computing, Vol. 38, No. 1, pp. A500–A527,2016,doi: 10.1137/15M1027413

We introduce the Generalized Preconditioned Locally Harmonic Residual (GPLHR) method for solving standard and generalized non-Hermitian eigenproblems. The method is particularly useful for computing a subset of eigenvalues, and their eigen- or Schur vectors, closest to a given shift. The proposed method is based on block iterations and can take advantage of a preconditioner if it is available. It does not need to perform exact shift-and-invert transformation. Standard and generalized eigenproblems are handled in a unified framework. Our numerical experiments demonstrate that GPLHR is generally more robust and efficient than existing methods, especially if the available memory is limited.

### E. Vecharynski and C. Yang,"Preconditioned iterative methods for eigenvalue counts",to appear in Proceedings of International Workshop on Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, in Lecture Notes in Computational Science and Engineering, Springer,2016,

We describe preconditioned iterative methods for estimating the number of eigenvalues of a Hermitian matrix within a given interval. Such estimation is useful in a number of applications.In particular, it can be used to develop an efficient spectrum-slicing strategy to compute many eigenpairs of a Hermitian matrix. Our method is based on the Lanczos- and Arnoldi-type of iterations. We show that with a properly defined preconditioner, only a few iterations may be needed to obtain a good estimate of the number of eigenvalues within a prescribed interval. We also demonstrate that the number of iterations required by the proposed preconditioned schemes is independent of the size and condition number of the matrix. The efficiency of the methods is illustrated on several problems arising from density functional theory based electronic structure calculations.

### Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, Michel Robert,"Position Paper: OpenMP scheduling on ARM big. LITTLE architecture",9th Int’l Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG),Prague, Czech Republic,January 18, 2016,

Single-ISA heterogeneous multicore systems are emerging as a promising direction to
achieve a more suitable balance between performance and energy consumption. However,
a proper utilization of these architectures is essential to reach the energy benefits. In this
paper, we demonstrate the ineffectiveness of popular OpenMP scheduling policies
executing Rodinia benchmark on the Exynos 5 Octa (5422) SoC, which integrates the ARM
big. LITTLE architecture.

### E. Vecharynski,"A generalization of Saad's bound on harmonic Ritz vectors of Hermitian matrices",Linear Algebra and its Applications, Vol. 494, pp. 219-235,2016,doi: 10.1016/j.laa.2016.01.013

We prove a Saad's type bound for harmonic Ritz vectors of a Hermitian matrix. The new bound reveals a dependence of the harmonic Rayleigh-Ritz procedure on the condition number of a shifted problem operator. Several practical implications are discussed. In particular, the bound motivates incorporation of preconditioning into the harmonic Rayleigh-Ritz scheme.