# 2017 Publications

### John Bachan, Dan Bonachea, Paul H Hargrove, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Scott B Baden,"The UPC++ PGAS library for Exascale Computing",Proceedings of the Second Annual PGAS Applications Workshop,November 13, 2017,7,

We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.

### N. Sanderson, E. Shugerman, S. Molnar, J. Meiss E. Bradley,"Computational Topology Techniques for Characterizing Time-Series Data",Advances in Intelligent Data Analysis XVI 16th International Symposium, IDA 2017, London, UK, October 26–28, 2017, Proceedings,October 2017,pp.284-296,doi: 10.1007/978-3-319-68765-0_24

Topological data analysis (TDA), while abstract, allows a characterization of time-series data obtained from nonlinear and complex dynamical systems. Though it is surprising that such an abstract measure of structure—counting pieces and holes—could be useful for real-world data, TDA lets us compare different systems, and even do membership testing or change-point detection. However, TDA is computationally expensive and involves a number of free parameters. This complexity can be obviated by coarse-graining, using a construct called the witness complex. The parametric dependence gives rise to the concept of persistent homology: how shape changes with scale. Its results allow us to distinguish time-series data from different systems—e.g., the same note played on different musical instruments.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer’s Guide, v1.0-2017.9",Lawrence Berkeley National Laboratory Tech Report,September 29, 2017,LBNL 2001065, doi: 10.2172/1398522

This document has been superseded by: UPC++ Programmer’s Guide, v1.0-2018.3.0 (LBNL-2001136)

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### J Bachan, S Baden, D Bonachea, P Hargrove, S Hofmeyr, K Ibrahim, M Jacquelin, A Kamil, B Lelbach, B van Straalen,"UPC++ Specification v1.0, Draft 4",September 27, 2017,LBNL 2001066, doi: 10.2172/1398521

This document has been superseded by: UPC++ Specification v1.0, Draft 6 (LBNL-2001135)

UPC++ is a C++11 library providing classes and functions that support Asynchronous Partitioned Global Address Space (APGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Dan Bonachea, Paul Hargrove,"GASNet Specification, v1.8.1",Lawrence Berkeley National Laboratory Tech Report,August 31, 2017,LBNL 2001064, doi: 10.2172/1398512

GASNet is a language-independent, low-level networking layer that provides network-independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages and libraries such as UPC, UPC++, Co-Array Fortran, Legion, Chapel, and many others. The interface is primarily intended as a compilation target and for use by runtime library writers (as opposed to end users), and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking".

### Dongeun Lee, Alex Sim, Jaesik Choi, Kesheng Wu,"Improving Statistical Similarity Based Data Reduction for Non-Stationary Data",29th International Conference on Scientific and Statistical Database Management (SSDBM2017),2017,doi: 10.1145/3085504.3085583

Updated experiment version: https://sdm.lbl.gov/oapapers/ssdbm17-lee-upd.pdf
Original version: http://dl.acm.org/citation.cfm?doid=3085504.3085583

### Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli,"Efficient Programming for Multicore Processor Heterogeneity: OpenMP versus OmpSs",Open Source Supercomputing Workshop,Frankfurt, Germany,Springer’s Lecture Notes in Computer Science (LNCS),June 22, 2017,

ARM single-ISA heterogeneous multicore processors combine high-performance big cores with power-efficient small cores. They aim at achieving a suitable balance between performance and energy. How- ever, a main challenge is to program such architectures so as to efficiently exploit their features. In this paper, we study the impact on performance and energy trade-offs of single-ISA architecture according to OpenMP 3.0 and the OmpSs programming models. We consider different symmetric/asymmetric architecture configura- tions in terms of core frequency and core count between big and LITTLE clusters. Experiments are conducted on both a real Samsung Exynos 5 Octa system-on-chip and the gem5/McPAT simulation frameworks. Results show that OmpSs implementations are more sensitive to loop scheduling parameters than OpenMP 3.0. In most cases, best OmpSs configurations significantly outperform OpenMP ones. While cluster frequency asym- metry provides uninteresting results, asymmetric cluster configuration with single high-performance core and multiple low-power cores provides better performance/energy trade-offs in many cases.

### Dilip Vasudevan, Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf,"Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies",Workshop on HPC computing in a Post Moore’s law world (HCPM),June 22, 2017,

With the decline and eventual end of historical rates of lithographic scaling, we arrive at a crossroad where synergistic and holistic decisions are required to preserve Moore's law technology scaling. Numerous emerging technologies aim to extend digital electronics scaling of performance, energy efficiency, and computational power/density,
ranging from devices (transistors), memories, 3D integration capabilities, specialized architectures, photonics, and others.
The wide range of technology options creates the need for an integrated strategy to understand the impact of these emerging technologies on future large-scale digital systems for diverse application requirements and optimization metrics.
In this paper, we argue for a comprehensive methodology that spans the different levels of abstraction -- from materials, to devices, to complex digital systems and applications. Our approach integrates compact models of low-level characteristics of the emerging technologies to inform higher-level simulation models to evaluate their responsiveness to application requirements.
The integrated framework can then automate the search for an optimal architecture using available emerging technologies to maximize a targeted optimization metric.

### SM Martin, MJ Berger, SB Baden,"Toucan-A Translator for Communication Tolerant MPI Applications",Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017,June 2017,998-1007,doi: 10.1109/IPDPS.2017.44

We discuss early results with Toucan, a source-to-source  translator that automatically restructures C/C++ MPI applications to overlap communication with computation. We co-designed the translator and runtime system to enable dynamic, dependence-driven execution of MPI applications, and require only a modest amount of programmer annotation. Co-design was essential to realizing overlap through dynamic code block reordering and avoiding the limitations of static code relocation and inlining. We demonstrate that Toucan hides significant communication in four representative applications running on up to 24K cores of NERSC's Edison platform. Using Toucan, we have hidden from 33% to 85% of the communication overhead, with performance meeting or exceeding that of painstakingly hand-written overlap variants. © 2017 IEEE.

### Dai Wang, Junyu Gaob, Pan Li, Bin Wang, Cong Zhang, Samveg Saxena,"Modeling of plug-in electric vehicle travel patterns and charging load based on trip chain generation",Journal of Power Sources,May 13, 2017,359:468 - 479,doi: 10.1016/j.jpowsour.2017.05.036

Modeling PEV travel and charging behavior is the key to estimate the charging demand and further explore the potential of providing grid services. This paper presents a stochastic simulation methodology to generate itineraries and charging load profiles for a population of PEVs based on real-world vehicle driving data. In order to describe the sequence of daily travel activities, we use the trip chain model which contains the detailed information of each trip, namely start time, end time, trip distance, start location and end location. A trip chain generation method is developed based on the Naive Bayes model to generate a large number of trips which are temporally and spatially coupled. We apply the proposed methodology to investigate the multi-location charging loads in three different scenarios. Simulation results show that home charging can meet the energy demand of the majority of PEVs in an average condition. In addition, we calculate the lower bound of charging load peak on the premise of lowest charging cost. The results are instructive for the design and construction of charging facilities to avoid excessive infrastructure.

### Yingqi Xiong, Bin Wang, Chi-cheng Chu, Rajit Gadh,"Distributed Optimal Vehicle Grid Integration Strategy with User Behavior Prediction",IEEE PES General Meeting 2017,March 13, 2017,

With the increasing of electric vehicle (EV) adoption in recent years, the impact of EV charging activities to the power grid becomes more and more significant. In this article, an optimal scheduling algorithm which combines smart EV charging and V2G gird service is developed to integrate EVs into power grid as distributed energy resources, with improved system cost performance. Specifically, an optimization problem is formulated and solved at each EV charging station according to control signal from aggregated control center and user charging behavior prediction by mean estimation and linear regression. The control center collects distributed optimization results and updates the control signal, periodically. The iteration continues until it converges to optimal scheduling. Experimental result shows this algorithm helps fill the valley and shave the peak in electric load profiles within a microgrid, while the energy demand of individual driver can be satisfied.

### Yubo Wan, Wenbo Shi, Bin Wang, Chi-Cheng Ch, Rajit Gadh,"Optimal operation of stationary and mobile batteries in distribution grids",Applied Energy,January 28, 2017,190:1289 - 130,doi: 10.1016/j.apenergy.2016.12.139

The trending integrations of Battery Energy Storage System (BESS, stationary battery) and Electric Vehicles (EV, mobile battery) to distribution grids call for advanced Demand Side Management (DSM) technique that addresses the scalability concerns of the system and stochastic availabilities of EVs. Towards this goal, a stochastic DSM is proposed to capture the uncertainties in EVs. Numerical approximation is then used to make the problem tractable. To accelerate the computational speed, the proposed DSM is tightly relaxed to a convex form using second-order cone programming. Furthermore, in light of the continuous increasing problem size, a distributed method with a guaranteed convergence is applied to shift the centralized computational burden to distributed controllers. To verify the proposed DSM, real-life EV data collected on UCLA campus is used to test the proposed DSM in an IEEE benchmark test system. Numerical results demonstrate the correctness and merits of the proposed approach.