# Publications

### Dongeun Lee, Alex Sim, Jaesik Choi, Kesheng Wu,"Improving Statistical Similarity Based Data Reduction for Non-Stationary Data",29th International Conference on Scientific and Statistical Database Management (SSDBM2017),2017,doi: 10.1145/3085504.3085583

Updated experiment version: https://sdm.lbl.gov/oapapers/ssdbm17-lee-upd.pdf
Original version: http://dl.acm.org/citation.cfm?doid=3085504.3085583

### Dilip Vasudevan, Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf,"Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies",Workshop on HPC computing in a Post Moore’s law world (HCPM),June 22, 2017,

With the decline and eventual end of historical rates of lithographic scaling, we arrive at a crossroad where synergistic and holistic decisions are required to preserve Moore's law technology scaling. Numerous emerging technologies aim to extend digital electronics scaling of performance, energy efficiency, and computational power/density,
ranging from devices (transistors), memories, 3D integration capabilities, specialized architectures, photonics, and others.
The wide range of technology options creates the need for an integrated strategy to understand the impact of these emerging technologies on future large-scale digital systems for diverse application requirements and optimization metrics.
In this paper, we argue for a comprehensive methodology that spans the different levels of abstraction -- from materials, to devices, to complex digital systems and applications. Our approach integrates compact models of low-level characteristics of the emerging technologies to inform higher-level simulation models to evaluate their responsiveness to application requirements.
The integrated framework can then automate the search for an optimal architecture using available emerging technologies to maximize a targeted optimization metric.

### E. Vecharynski, A. Knyazev,"Preconditioned steepest descent-like methods for symmetric indefinite systems",Linear Algebra and its Applications, Vol. 511, pp. 274–295,2016,

We construct preconditioned steepest descent (PSD)-like methods for iterative solution of symmetric indefinite linear systems using symmetric and positive definite (SPD) preconditioners. Our construction is based on a locally optimal residual minimization over two-dimensional subspaces, mathematically equivalent in exact arithmetic to preconditioned MINRES (PMINRES) restarted after every two steps. A convergence bound is derived. If certain information on the spectrum of the preconditioned system is available, we present a simpler PSD-like algorithm that performs only one-dimensional residual minimization. Search direction randomization for accelerating this algorithm is discussed. Our primary goal is to bridge the theoretical gap between the optimal (PMINRES) and PSD-like methods for solving symmetric indefinite systems. We also demonstrate situations where the suggested PSD-like schemes can be preferable to the optimal PMINRES iteration.

### S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer,"A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering",Journal of Applied Crystallography,December 2016,49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

Best Paper Award

### R. Li, Y. Xi, E. Vecharynski, C. Yang, and Y. Saad,"A Thick-Restart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems",SIAM Journal on Scientific Computing, Vol. 38, Issue 4, pp. A2512–A2534,2016,doi: 10.1137/15M1054493

Polynomial filtering can provide a highly effective means of computing all eigenvalues of a real symmetric (or complex Hermitian) matrix that are located in a given interval, anywhere in the spectrum. This paper describes a technique for tackling this problem by combining a Thick-Restart version of the Lanczos algorithm with deflation ('locking') and a new type of polynomial filters obtained from a least-squares technique. The resulting algorithm can be utilized in a 'spectrum-slicing' approach whereby a very large number of eigenvalues and associated eigenvectors of the matrix are computed by extracting eigenpairs located in different sub-intervals independently from one another.

### Nils E. R. Zimmermann, Maciej Haranczyk,"History and Utility of Zeolite Framework-Type Discovery from a Data-Science Perspective",Crystal Growth & Design,May 2, 2016,

Mature applications such as fluid catalytic cracking and hydrocracking rely critically on early zeolite structures. With a data-driven approach, we find that the discovery of exceptional zeolite framework types around the new millennium was spurred by exciting new utilization routes. The promising processes have yet not been successfully implemented (“valley of death” effect), mainly because of the lack of thermal stability of the crystals. This foreshadows limited deployability of recent zeolite discoveries that were achieved by novel crystal synthesis routes.

Watch a movie illustrating our seeded simulation strategy here.

### George Michelogiannakis, John Shalf, David Donofrio, John Bachan,,"Continuing the Scaling of Digital Computing Post Moore’s Law",LBNL report,April 2016,LBNL 1005126,

The approaching end of traditional CMOS technology scaling that up until now followed Moore's law is coming to an end in the next decade. However, the DOE has come to depend on the rapid, predictable, and cheap scaling of computing performance to meet mission needs for scientific theory, large scale experiments, and national security. Moving forward, performance scaling of digital computing will need to originate from energy and cost reductions that are a result of novel architectures, devices, manufacturing technologies, and programming models. The deeper issue presented by these changes is the threat to DOE’s mission and to the future economic growth of the U.S. computing industry and to society as a whole. With the impending end of Moore’s law, it is imperative for the Office of Advanced Scientific Computing Research (ASCR) to develop a balanced research agenda to assess the viability of novel semiconductor technologies and navigate the ensuing challenges. This report identifies four areas and research directions for ASCR and how each can be used to preserve performance scaling of digital computing beyond exascale and after Moore's law ends.

### J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno,"An efficient basis set representation for calculating electrons in molecules",Journal of Molecular Physics,2016,doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

### E. Vecharynski, C. Yang, and F. Xue,"Generalized preconditioned locally harmonic residual method for non-Hermitian eigenproblems",SIAM Journal on Scientific Computing, Vol. 38, No. 1, pp. A500–A527,2016,doi: 10.1137/15M1027413

We introduce the Generalized Preconditioned Locally Harmonic Residual (GPLHR) method for solving standard and generalized non-Hermitian eigenproblems. The method is particularly useful for computing a subset of eigenvalues, and their eigen- or Schur vectors, closest to a given shift. The proposed method is based on block iterations and can take advantage of a preconditioner if it is available. It does not need to perform exact shift-and-invert transformation. Standard and generalized eigenproblems are handled in a unified framework. Our numerical experiments demonstrate that GPLHR is generally more robust and efficient than existing methods, especially if the available memory is limited.

### E. Vecharynski and C. Yang,"Preconditioned iterative methods for eigenvalue counts",to appear in Proceedings of International Workshop on Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, in Lecture Notes in Computational Science and Engineering, Springer,2016,

We describe preconditioned iterative methods for estimating the number of eigenvalues of a Hermitian matrix within a given interval. Such estimation is useful in a number of applications.In particular, it can be used to develop an efficient spectrum-slicing strategy to compute many eigenpairs of a Hermitian matrix. Our method is based on the Lanczos- and Arnoldi-type of iterations. We show that with a properly defined preconditioner, only a few iterations may be needed to obtain a good estimate of the number of eigenvalues within a prescribed interval. We also demonstrate that the number of iterations required by the proposed preconditioned schemes is independent of the size and condition number of the matrix. The efficiency of the methods is illustrated on several problems arising from density functional theory based electronic structure calculations.

### E. Vecharynski,"A generalization of Saad's bound on harmonic Ritz vectors of Hermitian matrices",Linear Algebra and its Applications, Vol. 494, pp. 219-235,2016,doi: 10.1016/j.laa.2016.01.013

We prove a Saad's type bound for harmonic Ritz vectors of a Hermitian matrix. The new bound reveals a dependence of the harmonic Rayleigh-Ritz procedure on the condition number of a shifted problem operator. Several practical implications are discussed. In particular, the bound motivates incorporation of preconditioning into the harmonic Rayleigh-Ritz scheme.

### D. B. Szyld, E. Vecharynski, and F. Xue,"Preconditioned eigensolvers for large-scale nonlinear Hermitian eigenproblems with variational characterizations. II. Interior eigenvalues.",SIAM Journal on Scientific Computing, Vol. 37, Issue 6, pp. A2969-A2997,2015,

We consider the solution of large-scale nonlinear algebraic Hermitian eigenproblems of the form $T(\lambda)v=0$ that admit a variational characterization of eigenvalues. These problems arise in a variety of applications and are generalizations of linear Hermitian eigenproblems $Av\!=\!\lambda Bv$. In this paper, we propose a Preconditioned Locally Minimal Residual (PLMR) method for efficiently computing interior eigenvalues of problems of this type. We discuss the development of search subspaces, preconditioning, and eigenpair extraction procedure based on the refined Rayleigh-Ritz projection. Extension to the block methods is presented, and a moving-window style soft deflation is described. Numerical experiments demonstrate that PLMR methods provide a rapid and robust convergence towards interior eigenvalues. The approach is also shown to be efficient and reliable for computing a large number of extreme eigenvalues, dramatically outperforming standard preconditioned conjugate gradient methods.

### Abhinav Sarje, Xiaoye S Li, Slim Chourou, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer,"Inverse Modeling Nanostructures from X-Ray Scattering Data through Massive Parallelism",Supercomputing (SC'15),November 2015,

We consider the problem of reconstructing material nanostructures from grazing-incidence small-angle X-ray scattering (GISAXS) data obtained through experiments at synchrotron light-sources. This is an important tool for characterization of macromolecules and nano-particle systems applicable to applications such as design of energy-relevant nano-devices. Computational analysis of experimentally collected scattering data has been the primary bottleneck in this process.
We exploit the availability of massive parallelism in leadership-class supercomputers with multi-core and graphics processors to realize the compute-intensive reconstruction process. To develop a solution, we employ various optimization algorithms including gradient-based LMVM, derivative-free trust region-based POUNDerS, and particle swarm optimization, and apply these in a massively parallel fashion.
We compare their performance in terms of both quality of solution and computational speed. We demonstrate the effective utilization of up to 8,000 GPU nodes of the Titan supercomputer for inverse modeling of organic-photovoltaics (OPVs) in less than 15 minutes.

### E. Vecharynski, J. Brabec, M. Shao, N. Govind, C. Yang,"Efficient Block Preconditioned Eigensolvers for Linear Response Time-dependent Density Functional Theory",submitted to JCC,2015,

We present two efficient iterative algorithms for solving the linear response eigenvalue problem arising fromthe time dependent density functional theory. Although the matrix to be diagonalized is nonsymmetric, it has a special structure that can be exploited to save both memory and floating point operations. In particular, the nonsymmetric eigenvalue problem can be transformed into a product eigenvalue problem that is self-adjoint with respect to a K-inner product. This product eigenvalue problem can be solved efficiently by a modified Davidson algorithm and a modified locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm that make use of the K-inner product. The solution of the product eigenvalue problem yields one component of the eigenvector associated with the original eigenvalue problem. However, the other component of the eigenvector can be easily recovered in a postprocessing procedure. Therefore, the algorithms we present here are more efficient than existing algorithms that try to approximate both components of the eigenvectors simultaneously.The efficiency of the new algorithms is demonstrated by numerical examples.

### E. Vecharynski, A. Knyazev,"Preconditioned Locally Harmonic Residual Method for computing interior eigenpairs of certain classes of Hermitian matrices",SIAM Journal on Scientific Computing, Vol. 37, Issue 5, pp. S3–S29,2015,

We propose a Preconditioned Locally Harmonic Residual (PLHR) method for computing several interior eigenpairs of a generalized Hermitian eigenvalue problem, without traditional spectral transformations, matrix factorizations, or inversions. PLHR is based on a short-term recurrence, easily extended to a block form, computing eigenpairs simultaneously. PLHR can take advantage of Hermitian positive definite preconditioning, e.g., based on an approximate inverse of an absolute value of a shifted matrix, introduced in [SISC, 35 (2013), pp. A696–A718]. Our numerical experiments demonstrate that PLHR is efficient and robust for certain classes of large-scale interior eigenvalue problems, involving Laplacian and Hamiltonian operators, especially if memory requirements are tight.

### Jason Adams, Monica Lieng, Brooks Kuhn, Edward Guo, Edik Simonian, Sean Peisert, JP Delplanque, Nick Anderson,"Automated Mechanical Ventilator Waveform Analysis of Patient-Ventilator Asynchrony",CHEST Journal,Pages: 175AOctober 2015,doi: 10.1378/chest.2281731

PURPOSE: Mechanical ventilation is a life-saving intervention but is associated with adverse effects including ventilator-induced lung injury (VILI). Patient-ventilator asynchrony (PVA) is thought to contribute to VILI, but the study of PVA has been hampered by limited access to the high frequency, large volume data streams produced by modern ventilators and a lack of robust analytics. To address these limitations, we developed an automated pipeline for breath-by-breath analysis of ventilator waveform data.

METHODS: Simulated pressure and flow time series data representing normal breaths and common forms of PVA were generated on PB840 ventilators, collected unobtrusively using small, customized wireless peripheral devices, and transmitted to a networked server for storage and analysis. Two critical care physicians reviewed all waveforms to generate gold standards. Rule-based algorithms were developed to quantify inspiratory and expiratory tidal volumes (TV) and identify PVA subtypes including double trigger and delayed termination asynchrony. Data were split randomly into derivation and validation sets. Algorithm performance was compared to ventilator reported values and clinician annotation.

RESULTS: The mean difference between algorithm-determined and ventilator-reported TVs was 3.1% (99% CI ± 1.36%). Algorithm agreement with clinician annotation was excellent for double trigger PVA and moderate for delayed termination PVA, with Kappa statistics of 0.85 and 0.58, respectively. In the validation data set (n = 492 breaths), double trigger asynchrony was detected with an overall accuracy of 94.1%, sensitivity of 100%, and specificity of 92.8%.

CONCLUSIONS: A pipeline combining wireless ventilator data acquisition and rule-based analytic algorithms informed by the principles of bedside ventilator waveform analysis allows for automated, quantitative breath-by-breath analysis of patient-ventilator interactions.

CLINICAL IMPLICATIONS: We have recently deployed this system in the medical intensive care unit of the UC Davis Medical Center, which will enable further development of mechanical ventilation analytics. We have begun to explore the use of supervised machine learning and dynamic time series modeling to improve the classification of other common types of PVA and of clinical phenotypes associated with respiratory failure. This system will help to better define the epidemiology and clinical impact of PVA and other forms of off-target mechanical ventilation, and may lead to improved decision support and patient outcomes.

### Tobias Titze, Alexander Lauerer, Lars Heinke, Christian Chmelik, Nils E. R. Zimmermann, Frerich J. Keil, Douglas M. Ruthven, Jörg Kärger,"Transport in Nanoporous Materials Including MOFs: The Applicability of Fick’s Laws",Angew. Chem. Int. Ed.,2015,doi: 10.1002/anie.201506954

Diffusion in nanoporous host–guest systems is often considered to be too complicated to comply with such “simple” relationships as Fick’s first and second law of diffusion. However, it is shown herein that the microscopic techniques of diffusion measurement, notably the pulsed field gradient (PFG) technique of NMR spectroscopy and microimaging by interference microscopy (IFM) and IR microscopy (IRM), provide direct experimental evidence of the applicability of Fick’s laws to such systems. This remains true in many situations, even when the detailed mechanism is complex. The limitations of the diffusion model are also discussed with reference to the extensive literature on this subject.

### Burke, Daniel R.,"Neuromorphic Hardware for HPC",Conference,October 7, 2015,

Presented at Simons Institute Theory of Neural Computation Workshop.

### Nils E. R. Zimmermann, Bart Vorselaars, David Quigley, Baron Peters,"Nucleation of NaCl from Aqueous Solution: Critical Sizes, Ion-Attachment Kinetics, and Rates",J. Am. Chem. Soc.,2015,doi: 10.1021/jacs.5b08098

Nucleation and crystal growth are important in material synthesis, climate modeling, biomineralization, and pharmaceutical formulation. Despite tremendous efforts, the mechanisms and kinetics of nucleation remain elusive to both theory and experiment. Here we investigate sodium chloride (NaCl) nucleation from supersaturated brines using seeded atomistic simulations, polymorph-specific order parameters, and elements of classical nucleation theory. We find that NaCl nucleates via the common rock salt structure. Ion desolvation—not diffusion—is identified as the limiting resistance to attachment. Two different analyses give approximately consistent attachment kinetics: diffusion along the nucleus size coordinate and reaction-diffusion analysis of approach-to-coexistence simulation data from Aragones et al. (J. Chem. Phys. 2012, 136, 244508). Our simulations were performed at realistic supersaturations to enable the first direct comparison to experimental nucleation rates for this system. The computed and measured rates converge to a common upper limit at extremely high supersaturation. However, our rate predictions are between 15 and 30 orders of magnitude too fast. We comment on possible origins of the large discrepancy.

Watch a movie illustrating our seeded simulation strategy here.

### Nathan Hanford, Vishal Ahuja, Mehmet Balman, Matthew. Farrens, Dipak Ghosal, Eric Pouyoul, Brian Tierney,"Improving Network Performance on Multicore Systems: Impact of Core Affinities on High Throughput Flows",The International Journal of eScience, Elsevier,2015,doi: doi:10.1016/j.future.2015.09.012

Network throughput is scaling-up to higher data rates while end-system processors are scaling-out to multiple cores. In order to optimize high speed data transfer into multicore end-systems, techniques such as network adaptor offloads and performance tuning have received a great deal of attention. Furthermore, several methods of multi-threading the network receive process have been proposed. However, thus far attention has been focused on how to set the tuning parameters and which offloads to select for higher performance, and little has been done to understand why the various parameter settings do (or do not) work. In this paper, we build on previous research to track down the sources of the end-system bottleneck for high-speed TCP flows. We define protocol processing efficiency to be the amount of system resources (such as CPU and cache) used per unit of achieved throughput (in Gbps). The amount of various system resources consumed are measured using low-level system event counters. In a multicore end-system, affinitization, or core binding, is the decision regarding how the various tasks of network receive process including interrupt, network, and application processing are assigned to the different processor cores. We conclude that affinitization has a significant impact on protocol processing efficiency, and that the performance bottleneck of the network receive process changes significantly with different affinitization.

### Štěpán Timr, Jiří Brabec, Alexey Bondar, Tomáš Ryba, Miloš Železný, Josef Lazar, Pavel Jungwirth,"Non-Linear Optical Properties of Fluorescent Dyes Allow for Accurate Determination of Their Molecular Orientations in Phospholipid Membranes",The Journal of Physical Chemistry,July 6, 2015,

Several methods based on single- and two-photon fluorescence detected linear dichroism have recently been used to determine the orientational distributions of fluorescent dyes in lipid membranes. However, these determinations relied on simplified descriptions of non-linear anisotropic properties of the dye molecules, using a transition dipole moment-like vector instead of an absorptivity tensor. To investigate the validity of the vector approximation, we have now carried out a combination of computer simulations and polarization microscopy experiments on two representative fluorescent dyes (DiI and F2N12S) embedded in aqueous phosphatidylcholine bilayers. Our results indicate that a simplified vector-like treatment of the two-photon transition tensor is applicable for molecular geometries sampled in the membrane at ambient conditions. Furthermore, our results allow evaluation of several distinct polarization microscopy techniques. In combination, our results point to a robust and accurate experimental and computational treatment of orientational distributions of DiI, F2N12S and related dyes (including Cy3, Cy5, and others), with implications to monitoring physiologically relevant processes in cellular membranes in a novel way.

### Abhinav Sarje,Computing Nanostructures at Scale,OLCF User Meeting,June 2015,

The inverse modeling, or structural fitting, problem of recovering nanostructures from X-ray scattering data obtained through experiments at light-source synchrotrons is an ideal example of a Big Data and Big Compute application. X-ray scattering based extraction of structural information from material samples is an important tool for nanostructure prediction through characterization of macromolecules and nanoparticle systems, applicable to numerous applications such as design of energy-relevant nano-devices. At Berkeley Lab, we are developing high-performance solutions for analysis of such raw data. In our work we exploit the use of massive parallelism available in clusters of GPUs, such as the Titan supercomputer, to gain efficiency in the reconstruction process. We explore the application of various numerical optimization algorithms ranging from simple gradient-based quasi-Newton methods, derivativefree trust-region-based methods, to the stochastic algorithms of Particle Swarm Optimization in a massively parallel fashion. https://vimeo.com/133558018

### Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker,"Parallel Performance Optimizations on Unstructured Mesh-Based Simulations",Procedia Computer Science,1877-0509,June 2015,51:2016-2025,doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

### Abhinav Sarje,Recovering Structural Information about Nanoparticle Systems,Nvidia GPU Technology Conference,March 19, 2015,

The inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at light-source synchrotrons is an ideal example of a Big Data and Big Compute application. This session will give an introduction and overview to this problem and its solutions as being developed at the Berkeley Lab. X-ray scattering based extraction of structural information from material samples is an important tool applicable to numerous applications such as design of energy-relevant nano-devices. We exploit the use of parallelism available in clusters of GPUs to gain efficiency in the reconstruction process. To develop a solution, we apply Particle Swarm Optimization (PSO) in a massively parallel fashion, and develop high-performance codes and analyze the performance.

### Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer,"Recovering Nanostructures from X-Ray Scattering Data",Nvidia GPU Technology Conference (GTC),March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.

### E. Vecharynski, C. Yang, J. E. Pask,"A projected preconditioned conjugate gradient algorithm for computing many extreme eigenpairs of a Hermitian matrix",Journal of Computational Physics, Vol. 290, pp. 73–89,2015,

We present an iterative algorithm for computing an invariant subspace associated with the algebraically smallest eigenvalues of a large sparse or structured Hermitian matrix A. We are interested in the case in which the dimension of the invariant subspace is large (e.g., over several hundreds or thousands) even though it may still be small relative to the dimension of A. These problems arise from, for example, density functional theory (DFT) based electronic structure calculations for complex materials. The key feature of our algorithm is that it performs fewer Rayleigh–Ritz calculations compared to existing algorithms such as the locally optimal block preconditioned conjugate gradient or the Davidson algorithm. It is a block algorithm, and hence can take advantage of efficient BLAS3 operations and be implemented with multiple levels of concurrency. We discuss a number of practical issues that must be addressed in order to implement the algorithm efficiently on a high performance computer.

### Wei Hu, Lin Lin and Chao Yang,"Edge reconstruction in armchair phosphorene nanoribbons revealed by discontinuous Galerkin density functional theory",Phys. Chem. Chem. Phys., 2015, Advance Article,February 11, 2015,doi: 10.1039/C5CP00333D

With the help of our recently developed massively parallel DGDFT (Discontinuous Galerkin Density Functional Theory) methodology, we perform large-scale Kohn–Sham density functional theory calculations on phosphorene nanoribbons with armchair edges (ACPNRs) containing a few thousands to ten thousand atoms. The use of DGDFT allows us to systematically achieve a conventional plane wave basis set type of accuracy, but with a much smaller number (about 15) of adaptive local basis (ALB) functions per atom for this system. The relatively small number of degrees of freedom required to represent the Kohn–Sham Hamiltonian, together with the use of the pole expansion the selected inversion (PEXSI) technique that circumvents the need to diagonalize the Hamiltonian, results in a highly efficient and scalable computational scheme for analyzing the electronic structures of ACPNRs as well as their dynamics. The total wall clock time for calculating the electronic structures of large-scale ACPNRs containing 1080–10 800 atoms is only 10–25 s per self-consistent field (SCF) iteration, with accuracy fully comparable to that obtained from conventional planewave DFT calculations. For the ACPNR system, we observe that the DGDFT methodology can scale to 5000–50 000 processors. We use DGDFT based ab initio molecular dynamics (AIMD) calculations to study the thermodynamic stability of ACPNRs. Our calculations reveal that a 2 × 1 edge reconstruction appears in ACPNRs at room temperature.

### Thorsten Kurth, Andrew Pochinsky, Abhinav Sarje, Sergey Syritsyn, Andre Walker-Loud,"High-Performance I/O: HDF5 for Lattice QCD",arXiv:1501.06992,January 2015,

Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design and implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.

### D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A.I. Krylov,"New algorithms for iterative matrix-free eigensolvers in quantum chemistry",Journal of Computational Chemistry, Vol. 36, Issue 5, pp. 273–284,2015,

New algorithms for iterative diagonalization procedures that solve for a small set of eigen-states of a large matrix are described. The performance of the algorithms is illustrated by calculations of low and high-lying ionized and electronically excited states using equation-of-motion coupled-cluster methods with single and double substitutions (EOM-IP-CCSD and EOM-EE-CCSD). We present two algorithms suitable for calculating excited states that are close to a specified energy shift (interior eigenvalues). One solver is based on the Davidson algorithm, a diagonalization procedure commonly used in quantum-chemical calculations. The second is a recently developed solver, called the “Generalized Preconditioned Locally Harmonic Residual (GPLHR) method.” We also present a modification of the Davidson procedure that allows one to solve for a specific transition. The details of the algorithms, their computational scaling, and memory requirements are described. The new algorithms are implemented within the EOM-CC suite of methods in the Q-Chem electronic structure program.

### David H. Bailey, Stephanie Ger, Marcos Lopez de, Alexander Sim, Kesheng Wu,"Statistical Overfitting and Backtest Performance",Quantitative Finance,2015,

http://ssrn.com/abstract=2507040

### Wei Hu, Lin Lin, Chao Yang and Jinlong Yang,"Electronic structure and aromaticity of large-scale hexagonal graphene nanoflakes",J. Chem. Phys. 141, 214704 (2014),December 2, 2014,141:214704,doi: 10.1063/1.4902806

With the help of the recently developed SIESTA-PEXSI method [L. Lin, A. García, G. Huhs, and C. Yang, J. Phys.: Condens. Matter26, 305503 (2014)], we perform Kohn-Sham density functional theory calculations to study the stability and electronic structure of hydrogen passivated hexagonal graphene nanoflakes (GNFs) with up to 11 700 atoms. We find the electronic properties of GNFs, including their cohesive energy, edge formation energy, highest occupied molecular orbital-lowest unoccupied molecular orbital energy gap, edge states, and aromaticity, depend sensitively on the type of edges (armchair graphene nanoflakes (ACGNFs) and zigzag graphene nanoflakes (ZZGNFs)), size and the number of electrons. We observe that, due to the edge-induced strain effect in ACGNFs, large-scale ACGNFs’ edge formation energydecreases as their size increases. This trend does not hold for ZZGNFs due to the presence of many edge states in ZZGNFs. We find that the energy gaps E g of GNFs all decay with respect to 1/L, where L is the size of the GNF, in a linear fashion. But as their size increases, ZZGNFs exhibit more localized edge states. We believe the presence of these states makes their gap decrease more rapidly. In particular, when L is larger than 6.40 nm, we find that ZZGNFs exhibit metallic characteristics. Furthermore, we find that the aromatic structures of GNFs appear to depend only on whether the system has 4N or 4N + 2 electrons, where N is an integer.

### J. Choi, A. Sim,Data reduction methods, systems, and devices,U.S. Patent Pending serial no. 14/555,365,2014,

U.S. Patent pending serial no. 14/555,365, “DATA REDUCTION METHODS, SYSTEMS, AND DEVICES”, filed on 11/26/2014. Provisional application no. 61/909,518. “An Efficient Data Reduction Method with Locally Exchangeable Measures”, J. Choi and A. Sim, filed on 11/27/2013, LBNL IB2013-133.

### J.A. Ang, R.F. Barrett, R.E. Benner, D. Burke, C. Chan, D. Donofrio, S.D. Hammond, K.S. Hemmertand S.M. Kelly, H. Le, V.J. Leung, D.R. Resnick, A.F. Rodrigues, J. Shalf, D. Stark, andN.J. Wright D. Unat,"Abstract Machine Models and Proxy Architectures for Exascale Computing",Co--HPC2014 (to appear),New Orleans, LA, USA,IEEE Computer Society,November 17, 2014,

To achieve Exascale computing, fundamental hardware architectures must change. The most significant consequence of this assertion is the impact on the scientific applications that run on current High Performance Computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to Exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the Exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to ellucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware architects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of models that can help software developers prepare for Exascale and through the use of proxy architectures, we can enable a more concrete exploration of how well application codes map onto the future architectures.

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer,"Tuning HipGISAXS on Multi and Many Core Supercomputers",High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation,Denver, CO,Springer International Publishing,2014,8551:217-238,doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer,"High-Performance Inverse Modeling with Reverse Monte Carlo Simulations",43rd International Conference on Parallel Processing,Minneapolis, MN,IEEE,September 2014,201-210,doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

### Wenqi Xia, Wei Hu, Zhenyu Li and Jinlong Yang,"A first-principles study of gas adsorption on germanene",Phys. Chem. Chem. Phys., 2014,16, 22495-22498,August 29, 2014,doi: 10.1039/C4CP03292F

The adsorption of common gas molecules (N2, CO, CO2, H2O, NH3, NO, NO2, and O2) on germanene is studied with density functional theory. The results show that N2, CO, CO2, and H2O are physisorbed on germanene via van der Waals interactions, while NH3, NO, NO2, and O2 are chemisorbed on germanene via strong covalent (Ge–N or Ge–O) bonds. The chemisorption of gas molecules on germanene opens a band gap at the Dirac point of germanene. NO2 chemisorption on germanene shows strong hole doping in germanene. O2 is easily dissociated on germanene at room temperature. Different adsorption behaviors of common gas molecules on germanene provide a feasible way to exploit chemically modified germanene.

### N. Hanford, V. Ahuja, M. Farrens, D. Ghosal, M. Balman, E. Pouyoul, B. Tierney,"Analysis of the effect of core affinity on high-throughput flows",NDM'14,ACM,2014,doi: 10.1109/NDM.2014.10

Network throughput is scaling-up to higher data rates while end-system processors are scaling-out to multiple cores. In order to optimize high speed data transfer into multicore end-systems, techniques such as network adapter offloads and performance tuning have received a great deal of attention. Furthermore, several methods of multithreading the network receive process have been proposed. However, thus far attention has been focused on how to set the tuning parameters and which offloads to select for higher performance, and little has been done to understand why the settings do (or do not) work. In this paper we build on previous research to track down the source(s) of the end-system bottleneck for high-speed TCP flows. For the purposes of this paper, we consider protocol processing efficiency to be the amount of system resources used (such as CPU and cache) per unit of achieved throughout (in Gbps). The amount of various system resources consumed are measured using low-level system event counters. Affinitization, or core binding, is the decision about which processor cores on an end system are responsible for interrupt, network, and application processing. We conclude that affinitization has a significant impact on protocol processing efficiency, and that the performance bottleneck of the network receive process changes drastically with three distinct affinitization scenarios.

### N. Hanford, V. Ahuja, M. Farrens, D. Ghosal, M. Balman, E. Pouyoul, B. Tierney,"Impact of the end-system and affinities on the throughput of high-speed flows",ANCS '14: Proceedings of the tenth ACM/IEEE symposium on Architectures for networking and communications systems,ACM,2014,doi: 10.1145/2658260.2661772

Network throughput is scaling-up to higher data rates while processors are scaling-out to multiple cores. In order to optimize high speed data transfer into multicore end-systems, network adapter offloads and performance tuning have received a great deal of attention. However, much of this attention is focused on how to set the tuning parameters and which offloads to select for higher performance and not why they do (or do not) work. In this study we have attempted to address two issues that impact the data transfer performance. First is the impact of the processor core affinity (or core binding) which determines the choice of which processor core or cores handle certain tasks in a network- or I/O-heavy application running on a multicore end-system. Second issue is the impact of Ethernet pause frames which provides a link layer flow control in addition to the end-to-end flow control provided by TCP. The goal of our research is to delve deeper into why these tuning suggestions and this offload exist, and how they affect the end-to-end performance and efficiency of a single, large TCP flow.

### Amir Kamil, Yili Zheng, Katherine Yelick,"A Local-View Array Library for Partitioned Global Address Space C++ Programs",ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming,June 2014,

Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

### David H. Bailey, Jonathan M. Borwein, Marcos Lopez de Prado, Qiji Jim Zhu,"Pseudo-mathematics and financial charlatanism: The effects of backtest over fitting on out-of-sample performance",Notices of the American Mathematical Society,May 1, 2014,458-471,

Recent computational advances allow investment managers to search for profitable investment strategies. In many instances, that search involves a pseudo-mathematical argument, which is spuriously validated through a simulation of its historical performance (also called backtest).

We prove that high performance is easily achievable after backtesting a relatively small number of alternative strategy configurations, a practice we denote “backtest overfitting”. The higher the number of configurations tried, the greater is the probability that the backtest is overfit. Because financial analysts rarely report the number of configurations tried for a given backtest, investors cannot evaluate the degree of overfitting in most investment proposals.

The implication is that investors can be easily misled into allocating capital to strategies that appear to be mathematically sound and empirically supported by an outstanding backtest. This practice is particularly pernicious, because due to the nature of financial time series, backtest overfitting has a detrimental effect on the future strategy’s performance.

### Adrian Tate, Amir Kamil, Anshu Dubey, Armin Größlinger, Brad Chamberlain, Brice Goglin, Carter Edwards, Chris J. Newburn, David Padua, Didem Unat, Emmanuel Jeannot, Frank Hannig, Gysi Tobias, Hatem Ltaief, James Sexton, Jesus Labarta, John Shalf, Karl Fuerlinger, Kathryn O’Brien, Leonidas Linardakis, Maciej Besta, Marie-Christine Sawley, Mark Abraham, Mauro Bianco, Miquel Pericàs, Naoya Maruyama, Paul Kelly, Peter Messmer, Robert B. Ross, Romain Cledat, Satoshi Matsuoka, Thomas Schulthess, Torsten Hoefler, Vitus Leung,"Programming Abstractions for Data Locality",2014 Workshop on Programming Abstractions for Data Locality,April 29, 2014,

The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.
Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehen- sive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to en- able performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.

The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.

Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.

### Amir Kamil,Managing Hierarchy with Teams in the SPMD Programming Model,Workshop on Programming Abstractions for Data Locality (PADAL'14),April 28, 2014,

The single program, multiple data (SPMD) model of parallelism is the dominant programming model for large-scale distributed-memory machines. Its simple structure maps well to such machines: it exposes the actual degree of available parallelism, leads to good locality, and can be implemented by efficient runtime systems. However, its simplicity also makes it difficult to manage hierarchy, both at the algorithmic level (e.g. divide-and-conquer algorithms) and in addressing the communication characteristics of hierarchical machines. In this talk, we present a hierarchical team mechanism that allows SPMD programs to manage hierarchy. We show that it allows divide-and-conquer algorithms such as sorting to be expressed in SPMD and that it enables optimizations for hierarchical machines, increasing the scalability and/or performance of multiple benchmarks. We also explore how hierarchical teams may prove useful in other programming abstractions, such as expressing hierarchical distribution of data.

### D. Yu, D. Katramatos, A. Sim, A. Shoshani,Co-scheduling of network resource provisioning and host-to-host bandwidth reservation on high-performance network and storage systems,US Patent 8,705,342 B2,2014,

US Patent 8,705,342 B2. “Co-scheduling of network resource provisioning and host-to-host bandwidth reservation on high-performance network and storage systems”, D. Yu, D. Katramatos, A. Sim, and A. Shoshani, Apr. 22, 2014, prior publication No. US 2012/0268053 A1 issued on Oct. 25, 2012, provisional application No. 61/393,750, filed on Oct. 15, 2010, LBNL IB-3152, BNL BSA 11-02.

### E. Vecharynski and Y. Saad,"Fast updating algorithms for latent semantic indexing",SIAM Journal on Matrix Analysis and Applications, Vol. 35, Issue 3, pp. 1105–1131,2014,

This paper discusses a few algorithms for updating the approximate singular value decomposition (SVD) in the context of information retrieval by latent semantic indexing (LSI) methods. A unifying framework is considered which is based on Rayleigh–Ritz projection methods. First, a Rayleigh–Ritz approach for the SVD is discussed and it is then used to interpret the Zha and Simon algorithms [SIAM J. Sci. Comput., 21 (1999), pp. 782–791]. This viewpoint leads to a few alternatives whose goal is to reduce computational cost and storage requirement by projection techniques that utilize subspaces of much smaller dimension. Numerical experiments show that the proposed algorithms yield accuracies comparable to those obtained from standard ones at a much lower computational cost.

### Abhinav Sarje,Towards Real-Time Nanostructure Prediction with GPUs,GPU Technology Conference,March 2014,

In the field of nanoparticle materials science, synchrotron light-sources play a
crucial role where X-ray scattering techniques are used for nanostructure prediction
through characterization of macromolecules and nanoparticle systems based on their
structural properties. Applications of these are widespread, including artificial
photosynthesis, solar cell membranes, photovoltaics and energy storage devices, smart
windows, high-density data storage media and drug discovery. Current state-of-the-art
high-throughput beamlines at light sources worldwide are capable of generating
terabytes of raw scattering data per week, and is continually growing. This has
created a big gap between data generation and data analysis. Consequently, the
beamline scientists and users have been faced with an extremely inefficient
utilization of the light sources, and they are expressing a growing need for real-time
data analysis tools to bridge this gap.

X-ray scattering comes in many flavors such as the widely used small angle X-ray
scattering (SAXS) and grazing incidence SAXS (GISAXS) which will be the case studies
in this session. Efforts are underway at Berkeley Lab to bring scattering data
analysis up to the speed of data generation through high-performance and parallel
computing. Such analysis is generally composed of two steps: 1) Forward Simulation,
and 2) Structural Fitting, which uses forward simulation as a building block. Forward
simulation of X-ray scattering experiments is an embarrassingly parallel computational
problem, making it an ideal candidate for implementation on many-core architectures
such as graphics processors and massively parallel computing. An example of such a
simulation code developed under these efforts is HipGISAXS, which is a
high-performance and massively parallel code capable of harnessing the computational
power offered by clusters of GPUs. HipGISAXS is a step towards real-time scattering
data analysis as it has already brought simulation times down to the order of
milliseconds and seconds from hours and days through the power of GPUs. The second
component, structural fitting, can be described as an inverse modeling and
computational optimization problem, involving a large number of variable parameters,
making it highly compute-intensive. An example of inverse modeling code, also
developed at Berkeley Lab, is HipRMC. This is a Reverse Monte Carlo (RMC), a popular
method to extract information from SAXS data, based implementation which utilizes GPU
computing to provide fast results.

Although GPUs are able to deliver high computational power through naive
implementations, they require intensive architecture-aware code tuning in order to
attain performance rates closer to their theoretical peaks. Such optimizations involve
mapping of computations and data transfers perfectly on to the architecture. HipGISAXS
and HipRMC include optimizations which enable them to perform significantly better
than other processor architectures.

### Richard L. Martin, Cory M. Simon, Berend Smit, Maciej Haranczyk,"In-silico design of porous polymer networks: high-throughput screening for methane storage materials",Journal of the American Chemical Society,March 10, 2014,

Porous polymer networks (PPNs) are a class of advanced porous materials that combine the advantages of cheap and stable polymers with the high surface areas and tunable chemistry of metal-organic frameworks. They are of particular interest for gas separation or storage applications, for instance as methane adsorbents for a vehicular natural gas tank or other portable applications.

### Richard L. Martin, Maciej Haranczyk,"Construction and Characterization of Structure Models of Crystalline Porous Polymers",Crystal Growth & Design,March 6, 2014,

Metal-organic frameworks (MOFs) and covalent organic frameworks (COFs) are examples of advanced porous polymeric materials that have emerged in recent years. Their crystalline structure and modular synthesis offer unmatched versatility in their design. By exchanging chemical building blocks, one can both explore the unlimited space of possible structural chemistry within an isoreticular (same crystal topology) series, as well as achieve a wide range of alternative topologies.

### Lev Sarkisov, Richard L. Martin, Maciej Haranczyk, Berend Smit,"On the Flexibility of Metal-Organic Frameworks",Journal of the American Chemical Society,January 24, 2014,

Occasional, large amplitude flexibility in metal-organic frameworks (MOFs) is one of the most intriguing recent discoveries in chemistry and material science. Yet, there is at present no theoretical framework that permits the identification of flexible structures in the rapidly expanding universe of MOFs. Here, we propose a simple method to predict whether a MOF is flexible, based on treating it as a system of rigid elements, connected by hinges. This proposition is correct in application to MOFs based on rigid carboxylate linkers.

### Wei Hu, Nan Xia, Xiaojun Wu, Zhenyu Li and Jinlong Yang,"Silicene as a highly sensitive molecule sensor for NH3, NO and NO2",Phys. Chem. Chem. Phys., 2014,16, 6957-6962,January 23, 2014,doi: 10.1039/C3CP55250K

On the basis of first-principles calculations, we demonstrate the potential application of silicene as a highly sensitive molecule sensor for NH3, NO, and NO2 molecules. NH3, NO and NO2 molecules chemically adsorb on silicene via strong chemical bonds. With distinct charge transfer from silicene to molecules, silicene and chemisorbed molecules form charge-transfer complexes. The adsorption energy and charge transfer in NO2-adsorbed silicene are larger than those of NH3- and NO-adsorbed silicones. Depending on the adsorbate types and concentrations, the silicene-based charge-transfer complexes exhibit versatile electronic properties with tunable band gap opening at the Dirac point of silicene. The calculated charge carrier concentrations of NO2-chemisorbed silicene are 3 orders of magnitude larger than intrinsic charge carrier concentration of graphene at room temperature. The results present a great potential of silicene for application as a highly sensitive molecule sensor.

### J.A. Sobota, S.-L. Yang, D. Leuenberger, A.F. Kemper, J.G. Analytis, I.R. Fisher, P.S. Kirchmann, T.P. Devereaux, Z.-X. Shen,"Ultrafast electron dynamics in the topological insulator Bi2Se3 studied by time-resolved photoemission spectroscopy",Journal of Electron Spectroscopy and Related Phenomena,January 22, 2014,

We characterize the topological insulator Bi2Se3 using time- and angle-resolved photoemission spectroscopy. By employing two-photon photoemission, a complete picture of the unoccupied electronic structure from the Fermi level up to the vacuum level is obtained. We demonstrate that the unoccupied states host a second Dirac surface state which can be resonantly excited by 1.5 eV photons. We then study the ultrafast relaxation processes following optical excitation. We find that they culminate in a persistent non-equilibrium population of the first Dirac surface state, which is maintained by a meta-stable population of the bulk conduction band. Finally, we perform a temperature-dependent study of the electron–phonon scattering processes in the conduction band, and find the unexpected result that their rates decrease with increasing sample temperature. We develop a model of phonon emission and absorption from a population of electrons, and show that this counter-intuitive trend is the natural consequence of fundamental electron–phonon scattering processes. This analysis serves as an important reminder that the decay rates extracted by time-resolved photoemission are not in general equal to single electron scattering rates, but include contributions from filling and emptying processes from a continuum of states.

### M.A. Sentef, M. Claassen, A.F. Kemper, B. Moritz, T. Oka, J.K. Freericks, T.P. Devereaux,"Theory of pump-probe photoemission in graphene and the generation of light-induced Haldane multilayers",arXiv pre-print,January 20, 2014,

The combination of time-reversal and inversion symmetry protects massless Dirac fermions in graphene and on the surface of topological insulators. In a milestone paper, Haldane envisioned that breaking either or both of these symmetries would open a gap at the Dirac points, allowing one to tune between a trivial insulator and a Chern insulator. While equilibrium band gap engineering has become a major theme since the first synthesis of monolayer graphene, it was only recently proposed that circularly polarized laser light could turn trivial equilibrium bands into topological nonequilibrium bands. Here we observe ultrafast band gap openings and paradoxical gap closings at a critical field strength. Importantly, the gap openings are accompanied by nontrivial changes of the band topology, realizing a photo-induced Haldane multilayer system. We show that pump-probe photoemission spectroscopy can track these transitions in real time via energy gaps exceeding 100 meV. The analogy with Haldane multilayers is revealed by nontrivial pseudospin textures, going from a monolayer p-wave to a bilayer d-wave symmetry at the critical field strength. We thus predict a nonequilibrium realization of a tunable Haldane multilayer model with a Berry curvature that can be tipped optically by small changes in external fields on femtosecond time scales. Since we are focused on the physics of chiral Dirac fermions, these results apply equally to all systems possessing Dirac points, such as surface states of topological insulators.

### E. Vecharynski, Y. Saad, and M. Sosonkina,"Graph partitioning using matrix values for preconditioning symmetric positive definite systems",SIAM Journal on Scientific Computing Vol. 36, Issue 1, pp. A63-A87,2014,

Prior to the parallel solution of a large linear system, it is required to perform a partitioning of its equations/unknowns. Standard partitioning algorithms are designed using the considerations of the efficiency of the parallel matrix-vector multiplication, and typically disregard the information on the coefficients of the matrix. This information, however, may have a significant impact on the quality of the preconditioning procedure used within the chosen iterative scheme. In the present paper, we suggest a spectral partitioning algorithm, which takes into account the information on the matrix coefficients and constructs partitions with respect to the objective of enhancing the quality of the nonoverlapping additive Schwarz (block Jacobi) preconditioning for symmetric positive definite linear systems. For a set of test problems with large variations in magnitudes of matrix coefficients, our numerical experiments demonstrate a noticeable improvement in the convergence of the resulting solution scheme when using the new partitioning approach.

### David H. Bailey, Stephanie Ger, Marcos L\ opez Prado, Alexander Sim, Kesheng Wu,"Statistical Overfitting and Backtest Performance",http://ssrn.com/abstract2507040, ( January 1, 2014)

ISBN 978-1-78548-008-9

### Michael Sentef, Alexander F. Kemper, Brian Moritz, James K. Freericks, Zhi-Xun Shen, and Thomas P. Devereaux,"Examining Electron-Boson Coupling Using Time-Resolved Spectroscopy",Phys. Rev. X 3, 041033 (2013),December 26, 2013,

Nonequilibrium pump-probe time-domain spectroscopies can become an important tool to disentangle degrees of freedom whose coupling leads to broad structures in the frequency domain. Here, using the time-resolved solution of a model photoexcited electron-phonon system, we show that the relaxational dynamics are directly governed by the equilibrium self-energy so that the phonon frequency sets a window for “slow” versus “fast” recovery. The overall temporal structure of this relaxation spectroscopy allows for a reliable and quantitative extraction of the electron-phonon coupling strength without requiring an effective temperature model or making strong assumptions about the underlying bare electronic band dispersion.

### Daniel T. Graves, Phillip Colella, David Modiano, Jeffrey Johnson, Bjorn Sjogreen, Xinfeng Gao,"A Cartesian Grid Embedded Boundary Method for the Compressible Navier Stokes Equations",Communications in Applied Mathematics and Computational Science,December 23, 2013,

In this paper, we present an unsplit method for the time-dependent
compressible Navier-Stokes equations in two and three dimensions.
We use a a conservative, second-order Godunov algorithm.
We use a Cartesian grid, embedded boundary method to resolve complex
boundaries.  We solve for viscous and conductive terms with a
second-order semi-implicit algorithm.  We demonstrate second-order
accuracy in solutions of smooth problems in smooth geometries and
demonstrate robust behavior for strongly discontinuous initial
conditions in complex geometries.

### Cory M. Simon, Jihan Kim, Li-Chiang Lin, Richard L. Martin, Maciej Haranczyk, Berend Smit,"Optimizing nanoporous materials for gas storage",Physical Chemistry Chemical Physics,December 4, 2013,

Natural gas, mostly methane, is an attractive replacement of petroleum fuels for automotive vehicles because of its economic and environmental advantages. The technological obstacle to using methane as a vehicular fuel is its comparatively low volumetric energy density, necessitating densification strategies to yield reasonable driving ranges from a reasonably sized tank.

### N. Plonka, A. F. Kemper, S. Graser, A. P. Kampf, T. P. Devereaux,"Tunneling spectroscopy for probing orbital anisotropy in iron pnictides",Phys. Rev. B 88, 174518 (2013),November 27, 2013,

Using realistic multiorbital tight-binding Hamiltonians and the T-matrix formalism, we explore the effects of a nonmagnetic impurity on the local density of states in Fe-based compounds. We show that scanning tunneling spectroscopy (STS) has very specific anisotropic signatures that track the evolution of orbital splitting (OS) and antiferromagnetic gaps. Both anisotropies exhibit two patterns that split in energy with decreasing temperature, but for OS these two patterns map onto each other under 90 rotation. STS experiments that observe these signatures should expose the underlying magnetic and orbital order as a function of temperature across various phase transitions.

### Nathan Hanford, Vishal Ahuja, Mehmet Balman, Matthew. Farrens, Dipak Ghosal, Eric Pouyoul, Brian Tierney,"Characterizing the Impact of End-System Affinities On the End-to-End Performance of High-Speed Flows",SC13 workshop,ACM,2013,doi: 10.1145/2534695.2534697

Multi-core end-systems use Receive Side Scaling (RSS) to parallelize protocol processing. RSS uses a hash function on the standard flow descriptors and an indirection table to as- sign incoming packets to receive queues which are pinned to specific cores. This ensures flow affinity in that the interrupt processing of all packets belonging to a specific flow is pro- cessed by the same core. A key limitation of standard RSS is that it does not consider the application process that con- sumes the incoming data in determining the flow affinity. In this paper, we carry out a detailed experimental anal- ysis of the performance impact of the application affinity in a 40 Gbps testbed network with a dual hexa-core end- system. We show, contrary to conventional wisdom, that when the application process and the flow are affinitized to the same core, the performance (measured in terms of end- to-end TCP throughput) is significantly lower than the line rate. Near line rate performance is observed when the flow and the application process are affinitized to different cores belonging to the same socket. Furthermore, affinitizing the application and the flow to cores on different sockets results in significantly lower throughput than the line rate. These results arise due to the memory bottleneck, which is demon- strated using preliminary correlational data on the cache hit rate in the core that services the application process.

### Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer,"HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data",Journal of Applied Crystallography,2013,46:1781-1795,doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

### Maciej Haranczyk, Li-Chiang Lin, Kyuho Lee, Richard L. Martin, Jeffrey B. Neaton, Berend Smit,"Methane storage capabilities of diamond analogues",Physical Chemistry Chemical Physics,October 31, 2013,

Methane can be an alternative fuel for vehicular usage provided that new porous materials are developed for its efficient adsorption-based storage. Herein, we search for materials for this application within the family of diamond analogues. We used density functional theory to investigate structures in which tetrahedral C atoms of diamond are separated by-CC-or-BN-groups, as well as ones involving substitution of tetrahedral C atoms with Si and Ge atoms.

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf,"Extending Summation Precision for Network Reduction Operations",25th International Symposium on Computer Architecture and High Performance Computing,IEEE Computer Society,October 2013,

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

### Wei Hu, Zhenyu Li and Jinlong Yang,"Structural, electronic, and optical properties of hybrid silicene and graphene nanocomposite",J. Chem. Phys. 139, 154704 (2013),October 16, 2013,doi: 10.1063/1.4824887

Structural, electronic, and optical properties of hybrid silicene and graphene (S/G) nanocomposite are examined with density functional theory calculations. It turns out that weak van der Waals interactions dominate between silicene and graphene with their intrinsic electronic properties preserved. Interestingly, interlayer interactions in hybrid S/G nanocomposite induce tunable p-type and n-type doping of silicene and graphene, respectively, showing their doping carrier concentrations can be modulated by their interfacial spacing.

### Wei Hu, Zhenyu Li and Jinlong Yang,"Surface and size effects on the charge state of NV center in nanodiamonds",Computational and Theoretical Chemistry, 2013, 1021, 49-53,October 1, 2013,doi: 10.1016/j.comptc.2013.06.015

Electronic structures and stability of nitrogen–vacancy (NV) centers doped in nanodiamonds (NDs) have been investigated with large-scale density functional theory (DFT) calculations. Spin polarized defect states are not affected by the particle sizes and surface decorations, while the band gap is sensitive to these effects. Induced by the spherical surface electric dipole layer, surface functionalization has a long-ranged impact on the stability of charged NV centers doped in NDs. NV− center doped in DNs is more favorable for n-type fluorinated diamond, while NV0 is preferred for p-type hydrogenated NDs. Therefore, surface decoration provides a useful way for defect state engineering.

### Amir Kamil, Katherine Yelick,"Hierarchical Computation in the SPMD Programming Model",26th International Workshop on Languages and Compilers for Parallel Computing,September 2013,

Large-scale parallel machines are programmed mainly with the single program, multiple data (SPMD) model of parallelism. While this model has advantages of scalability and simplicity, it does not fit well with divide-and-conquer parallelism or hierarchical machines that mix shared and distributed memory. In this paper, we define the recursive single program, multiple data model (RSPMD) that extends SPMD with a hierarchical team mechanism to support hierarchical algorithms and machines. We implement this model in the Titanium language and describe how to eliminate a class of deadlocks by ensuring alignment of collective operations. We present application case studies evaluating the RSPMD model, showing that it enables divide-and-conquer algorithms such as sorting to be elegantly expressed and that team collective operations increase performance of conjugate gradient by up to a factor of two. The model also facilitates optimizations for hierarchical machines, improving scalability of particle in cell by 8x and performance of sorting and a stencil code by up to 40% and 14%, respectively.

### J. A. Sobota, S.-L. Yang, A. F. Kemper, J. J. Lee, F. T. Schmitt, W. Li, R. G. Moore, J. G. Analytis, I. R. Fisher, P. S. Kirchmann, T. P. Devereaux, and Z.-X. Shen,"Direct Optical Coupling to an Unoccupied Dirac Surface State in the Topological Insulator Bi2Se3",Phys. Rev. Lett. 111, 136802 (2013),September 24, 2013,

We characterize the occupied and unoccupied electronic structure of the topological insulator Bi2Se3 by one-photon and two-photon angle-resolved photoemission spectroscopy and slab band structure calculations. We reveal a second, unoccupied Dirac surface state with similar electronic structure and physical origin to the well-known topological surface state. This state is energetically located 1.5 eV above the conduction band, which permits it to be directly excited by the output of a Ti:sapphire laser. This discovery demonstrates the feasibility of direct ultrafast optical coupling to a topologically protected, spin-textured surface state.

### Y. F. Kung, W.-S. Lee, C.-C. Chen, A. F. Kemper, A. P. Sorini, B. Moritz, and T. P. Devereaux,"Time-dependent charge-order and spin-order recovery in striped systems",Phys. Rev. B 88, 125114 (2013),September 24, 2013,

Using time-dependent Ginzburg-Landau theory, we study the role of amplitude and phase fluctuations in the recovery of charge-stripe and spin-stripe phases in response to a pump pulse that melts the orders. For parameters relevant to the case where charge order precedes spin order thermodynamically, amplitude recovery governs the initial time scales, while phase recovery controls behavior at longer times. In addition to these intrinsic effects, there is a longer spin reorientation time scale related to the scattering geometry that dominates the recovery of the spin phase. Coupling between the charge and spin orders locks the amplitude and similarly the phase recovery, reducing the number of distinct time scales. Our results well reproduce the major experimental features of pump-probe x-ray diffraction measurements on the striped nickelate La1.75Sr0.25NiO4. They highlight the main idea of this work, which is the use of time-dependent Ginzburg-Landau theory to study systems with multiple coexisting order parameters.

### Richard L Martin, Mahdi Niknam Shahrak, Joseph A Swisher, Cory M Simon, Julian P Sculley, Hong-Cai Zhou, Berend Smit, Maciej Haranczyk,"Modeling Methane Adsorption in Interpenetrating Porous Polymer Networks",The Journal of Physical Chemistry C,September 19, 2013,

Porous polymer networks (PPNs) are a class of porous materials of particular interest in a variety of energy-related applications because of their stability, high surface areas, and gas uptake capacities. Computationally derived structures for five recently synthesized PPN frameworks, PPN-2,-3,-4,-5, and-6, were generated for various topologies, optimized using semiempirical electronic structure methods, and evaluated using classical grand-canonical Monte Carlo simulations.

### Richard L. Martin, Maciej Haranczyk,"Insights into Multi-Objective Design of Metal–Organic Frameworks",Crystal Growth & Design,September 18, 2013,

Metal-organic framework (MOF) crystal topologies which permit the highest internal surface areas are identified by means of multiobjective optimization and abstract structure models. We demonstrate that MOF design efforts can be focused within five underlying nets to engineer distinct, Pareto-optimal compromises between high gravimetric and high volumetric surface area materials.

### Marielle Pinheiro, Richard L. Martin, Chris H. Rycroft, Maciej Haranczyk,"High accuracy geometric analysis of crystalline porous materials",CrystEngComm,September 5, 2013,

A number of algorithms to analyze crystalline porous materials and their porosity employ the Voronoi tessellation, whereby the space in the material is divided into irregular polyhedral cells that can be analyzed to determine the pore topology and structure. However, the Voronoi tessellation is only appropriate when atoms all have equal radii, and the natural generalization to structures with unequal radii leads to cells with curved boundaries, which are computationally expensive to compute.

### B. Moritz, A. F. Kemper, M. Sentef, T. P. Devereaux, J. K. Freericks,"Electron-Mediated Relaxation Following Ultrafast Pumping of Strongly Correlated Materials: Model Evidence of a Correlation-Tuned Crossover between Thermal and Nonthermal States",Phys. Rev. Lett. 111, 077401 (2013),2013,

We examine electron-electron mediated relaxation following ultrafast electric field pump excitation of the fermionic degrees of freedom in the Falicov-Kimball model for correlated electrons. The results reveal a dichotomy in the temporal evolution of the system as one tunes through the Mott metal-to-insulator transition: in the metallic regime relaxation can be characterized by evolution toward a steady state well described by Fermi-Dirac statistics with an increased effective temperature; however, in the insulating regime this quasithermal paradigm breaks down with relaxation toward a nonthermal state with a complicated electronic distribution as a function of momentum. We characterize the behavior by studying changes in the energy, photoemission response, and electronic distribution as functions of time. This relaxation may be observable qualitatively on short enough time scales that the electrons behave like an isolated system not in contact with additional degrees of freedom which would act as a thermal bath, especially when using strong driving fields and studying materials whose physics may manifest the effects of correlations.

### Marielle Pinheiro, Richard L. Martin, Chris H. Rycroft, Andrew Jones, Enrique Iglesia, Maciej Haranczyk,"Characterization and comparison of pore landscapes in crystalline porous materials",Journal of Molecular Graphics and Modelling,July 31, 2013,

Crystalline porous materials have many applications, including catalysis and separations. Identifying suitable materials for a given application can be achieved by screening material databases. Such a screening requires automated high-throughput analysis tools that characterize and represent pore landscapes with descriptors, which can be compared using similarity measures in order to select, group and classify materials. Here, we discuss algorithms for the calculation of two types of pore landscape descriptors.

### Wei Hu, Xiaojun Wu, Zhenyu Li and Jinlong Yang,"Helium separation via porous silicene based ultimate membrane",Nanoscale, 2013, 5, 9062-9066,July 11, 2013,doi: 10.1039/C3NR02326E

Helium purification has become more important for increasing demands in scientific and industrial applications. In this work, we demonstrated that the porous silicene can be used as an effective ultimate membrane for helium purification on the basis of first-principles calculations. Prinstine silicene monolayer is impermeable to helium gas with a high penetration energy barrier (1.66 eV). However, porous silicene with either Stone–Wales (SW) or divacancy (555[thin space (1/6-em)]777 or 585) defect presents a surmountable barrier for helium (0.33 to 0.78 eV) but formidable for Ne, Ar, and other gas molecules. In particular, the porous silicene with divacancy defects shows high selectivity for He/Ne and He/Ar, superior to graphene, polyphenylene, and traditional membranes.

### A.F. Kemper, M. Sentef, B. Moritz, C.C. Kao, Z.X. Shen, J.K. Freericks, T.P. Devereaux,"Mapping of the unoccupied states and relevant bosonic modes via the time dependent momentum distribution",Phys. Rev. B 87, 235139 (2013),June 28, 2013,

The unoccupied states of complex materials are difficult to measure, yet they play a key role in determining their properties. We propose a technique that can measure the unoccupied states, called time-resolved Compton scattering, which measures the time-dependent momentum distribution (TDMD). Using a nonequilibrium Keldysh formalism, we study the TDMD for electrons coupled to a lattice in a pump-probe setup. We find a direct relation between temporal oscillations in the TDMD and the dispersion of the underlying unoccupied states, suggesting that both can be measured by time-resolved Compton scattering. We demonstrate the experimental feasibility by applying the method to a model of MgB2 with realistic material parameters.

### Y. S. Lee, S. J. Moon, Scott C. Riggs, M. C. Shapiro, I. R. Fisher, Bradford W. Fulfer, Julia Y. Chan, A. F. Kemper, and D. N. Basov,"Infrared study of the electronic structure of the metallic pyrochlore iridate Bi2Ir2O7",Phys. Rev. B 87, 195143 (2013),May 30, 2013,

We investigated the electronic properties of a single crystal of metallic pyrochlore iridate Bi2Ir2O7 by means of infrared spectroscopy. Our optical conductivity data show the splitting of t2gbands into Jeff ones due to strong spin-orbit coupling. We observed a sizable midinfrared absorption near 0.2 eV which can be attributed to the optical transition within the Jeff,1/2 bands. More interestingly, we found an abrupt suppression of optical conductivity in the very far-infrared region. Our results suggest that the electronic structure of Bi2Ir2O7 is governed by the strong spin-orbit coupling and correlation effects, which are a prerequisite for theoretically proposed nontrivial topological phases in pyrochlore iridates.

### Richard L. Martin, Maciej Haranczyk,"Optimization-Based Design of Metal-Organic Framework Materials",Journal of Chemical Theory and Computation,May 16, 2013,

Metal–organic frameworks (MOFs) are a class of porous materials constructed from metal or metal oxide building blocks connected by organic linkers. MOFs are highly tunable structures that can in theory be custom designed to meet the specific pore geometry and chemistry required for a given application such as methane storage or carbon capture. However, due to the sheer number of potential materials, identification of optimal MOF structures is a significant challenge.

### Richard L. Martin, Li-Chiang Lin, Kuldeep Jariwala, Berend Smit, Maciej Haranczyk,"Mail-Order Metal–Organic Frameworks (MOFs): Designing Isoreticular MOF-5 Analogues Comprising Commercially Available Organic Molecules",The Journal of Physical Chemistry C,April 17, 2013,

Metal–organic frameworks (MOFs), a class of porous materials, are of particular interest in gas storage and separation applications due largely to their high internal surface areas and tunable structures. MOF-5 is perhaps the archetypal MOF; in particular, many isoreticular analogues of MOF-5 have been synthesized, comprising alternative dicarboxylic acid ligands. In this contribution we introduce a new set of hypothesized MOF-5 analogues, constructed from commercially available organic molecules.

### Nils E. R. Zimmermann, Timm J. Zabel, Frerich J. Keil,"Transport into Nanosheets: Diffusion Equations Put to Test",J. Phys. Chem. C,2013,117:7384-7390,doi: 10.1021/jp400152q

Ultrathin porous materials, such as zeolite nanosheets, are prominent candidates for performing catalysis, drug supply, and separation processes in a highly efficient manner due to exceptionally short transport paths. Predictive design of such processes requires the application of diffusion equations that were derived for macroscopic, homogeneous surroundings to nanoscale, nanostructured host systems. Therefore, we tested different analytical solutions of Fick’s diffusion equations for their applicability to methane transport into two different zeolite nanosheets (AFI, LTA) under instationary conditions. Transient molecular dynamics simulations provided hereby concentration profiles and uptake curves to which the different solutions were fitted. Two central conclusions were deduced by comparing the fitted transport coefficients. First, the transport can be described correctly only if concentration profiles are used and the transport through the solid–gas interface is explicitly accounted for by the surface permeability. Second and most importantly, we have unraveled a size limitation to applying the diffusion equations to nanoscale objects. This is because transport-diffusion coefficients, DT, and surface permeabilities, α, of methane in AFI become dependent on nanosheet thickness. Deviations can amount to factors of 2.9 and 1.4 for DT and α, respectively, when, in the worst case, results from the thinnest AFI nanosheet are compared with data from the thickest sheet. We present a molecular explanation of the size limitation that is based on memory effects of entering molecules and therefore only observable for smooth pores such as AFI and carbon nanotubes. Hence, our work provides important tools to accurately predict and intuitively understand transport of guest molecules into porous host structures, a fact that will become the more valuable the more tiny nanotechnological objects get.

Watch a movie illustrating the transient molecular dynamics approach, which was critical for this study, here.

### Wei Hu, Zhenyu Li and Jinlong Yang,"Electronic and optical properties of graphene and graphitic ZnO nanocomposite structures",J. Chem. Phys. 138, 124706 (2013),March 28, 2013,doi: 10.1063/1.4796602

Electronic and optical properties of graphene and graphitic ZnO (G/g-ZnO) nanocomposites have been investigated with density functional theory. Graphene interacts overall weakly with g-ZnO monolayer via van der Waals interaction. There is no charge transfer between the graphene and g-ZnO monolayer, while a charge redistribution does happen within the graphene layer itself, forming well-defined electron-hole puddles. When Al or Li is doped in the g-ZnO monolayer, substantial electron (n-type) and hole (p-type) doping can be induced in graphene, leading to well-separated electron-hole pairs at their interfaces. Improved optical properties in graphene/g-ZnO nanocomposite systems are also observed, with potential photocatalytic and photovoltaic applications.

### Mehmet Balman,"Advance Resource Provisioning in Bulk Data Scheduling",27th IEEE International Conference on Advanced Information Networking and Applications (AINA),2013,LBNL 6364E, doi: http://dx.doi.org/10.1109/AINA.2013.5

Today's scientific and business applications generate massive data sets that need to be transferred to remote sites for sharing, processing, and long term storage. Because of increasing data volumes and enhancement in current network technology that provide on-demand high-speed data access between collaborating institutions, data handling and scheduling problems have reached a new scale. In this paper, we present a new data scheduling model with advance resource provisioning, in which data movement operations are defined with earliest start and latest completion times. We analyze time-dependent resource assignment problem, and propose a new methodology to improve the current systems by allowing researchers and higher-level meta-schedulers to use data-placement as-a-service, so they can plan ahead and submit transfer requests in advance. In general, scheduling with time and resource conflicts is {NP-hard}. We introduce an efficient algorithm to organize multiple requests on the fly, while satisfying users' time and resource constraints. We successfully tested our algorithm in a simple benchmark simulator that we have developed, and demonstrated its performance with initial test results.

Keywords: scheduling with constraints, bulk data movement, time-dependent graphs, network reservation, Gale-Shapley algorithm

### Abhinav Sarje,Synchrotron Light-source Data Analysis through Massively-parallel GPU Computing,NVIDIA GPU Technology Conference (GTC),March 2013,

Light scattering techniques are widely used for the characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their properties, such as their structures and sizes, at the micro/nano-scales. One of the major applications of these is in the characterization of materials for the design and fabrication of energy-relevant nanodevices, such as photovoltaic and energy storage devices. Although current high-throughput synchrotron light-sources can generate tremendous amounts of raw data at a high rate, analysis of this data for the characterization processes remains the primary bottleneck, demanding large amounts of computational resources. In this work, we are developing high-performance parallel algorithms and codes on large-scale GPU clusters to address this bottleneck. Here, we will discuss our efforts and experiences in developing "architecture-aware" hybrid multi-GPU multi-CPU codes for two such most important analysis steps. First is simulation of X-ray scattering patterns for any given sample morphology, and second is structural fitting of such scattering patterns. Both steps involve a large number of variable parameters, and hence, require high computational power.
Our X-ray scattering pattern simulation code is based on the Distorted Wave Born Approximation (DWBA) theory, and involves a large number of compute-intensive form-factor calculations. A form-factor is computed as an integral over the shape functions of the nanoparticles in a sample. A simulated sample structure is taken as an input in the form of discretized shape-surfaces, such as a triangulated surface. Resolutions of such discretization, as well as of a spatial 3-D grid involved, also contribute toward the compute-intensity of the simulations. Our code uses hybrid GPU and multicore CPU acceleration for generation of high-resolution scattering patterns, for the given input, using various possible values of the input parameters. These parameters include a number of sample definition, experimental setup and computational parameters. The patterns obtained through the X-ray scattering simulations carry vital information about the structural properties of the materials in the sample. In order to extract meaningful structural information from the scattering patterns, structural fitting, as an inverse modeling problem, is used. Our codes implement a fast and scalable solution to this process through a Reverse Monte Carlo (RMC) simulation algorithm. This process computes structure-factors in each simulation step until a result fits the input image pattern within allowed error range. These computations require a large number of fast Fourier transform (FFTs) calculations which are also accelerated on hybrid GPU and CPU systems in our codes for high-performance.
Our codes are designed as GPU-architecture-aware implementations, and deliver high-performance through dynamic selection of the best-performing computational parameters, such as the computation decomposition parameters, block sizes, for the system being used. We perform a detailed study of the effects of these parameters on the code performance, and use this information to guide the parameter value selection process. We also carry out performance analysis of the optimal codes and study its scalings, including scaling to large GPU clusters. Our codes obtain near linear scaling with respect to the cluster size in our experiments, and we believe that these are "future-ready".

### E. Vecharynski and A. Knyazev,"Absolute value preconditioning for symmetric indefinite linear systems",SIAM Journal on Scientific Computing Vol. 35, Issue 2, pp. A696-A718,2013,

We introduce a novel strategy for constructing symmetric positive definite (SPD) preconditioners for linear systems with symmetric indefinite matrices. The strategy, called absolute value preconditioning, is motivated by the observation that the preconditioned minimal residual method with the inverse of the absolute value of the matrix as a preconditioner converges to the exact solution of the system in at most two steps. Neither the exact absolute value of the matrix nor its exact inverse are computationally feasible to construct in general. However, we provide a practical example of an SPD preconditioner that is based on the suggested approach. In this example we consider a model problem with a shifted discrete negative Laplacian and suggest a geometric multigrid (MG) preconditioner, where the inverse of the matrix absolute value appears only on the coarse grid, while operations on finer grids are based on the Laplacian. Our numerical tests demonstrate practical effectiveness of the new MG preconditioner, which leads to a robust iterative scheme with minimalist memory requirements.

### Wei Hu, Xiaojun Wu, Zhenyu Li and Jinlong Yang,"Porous silicene as a hydrogen purification membrane",Phys. Chem. Chem. Phys., 2013, 15, 5753-5757,February 22, 2013,doi: 10.1039/C3CP00066D

We investigated theoretically the hydrogen permeability and selectivity of a porous silicene membrane via first-principles calculations. The subnanometer pores of the silicene membrane are designed as divacancy defects with octagonal and pentagonal rings (585-divacancy). The porous silicene exhibits high selectivity comparable with graphene-based membranes for hydrogen over various gas molecules (N2, CO, CO2, CH4, and H2O). The divacancy defects in silicene are chemically inert to the considered gas molecules. Our results suggest that the porous silicene membrane is expected to find great potential in gas separation and filtering applications.

### Abhinav Sarje, Srinivas Aluru,"All-pairs computations on many-core graphics processors",Parallel Computing,2013,39-2:79-93,doi: 10.1016/j.parco.2013.01.002

Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.

Developing high-performance applications on emerging multi- and many-core
architectures requires efficient mapping techniques and architecture-specific
tuning methodologies to realize performance closer to their peak compute
capability and memory bandwidth. In this paper, we develop architecture-aware
methods to accelerate all-pairs computations on many-core graphics processors.
Pairwise computations occur frequently in numerous application areas in
scientific computing. While they appear easy to parallelize due to the
independence of computing each pairwise interaction from all others, development
of techniques to address multi-layered memory hierarchies, mapping within the
restrictions imposed by the small and low-latency on-chip memories, striking the
right balanced between concurrency, reuse and memory traffic etc., are crucial
to obtain high-performance. We present a hierarchical decomposition scheme for
GPUs based on decomposition of the output matrix and input data. We demonstrate
that a careful tuning of the involved set of decomposition parameters is
essential to achieve high efficiency on the GPUs. We also compare the
performance of our strategies with an implementation on the STI Cell processor
as well as multi-core CPU parallelizations using OpenMP and Intel Threading
Building Blocks.Developing high-performance applications on emerging multi- and many-core
architectures requires efficient mapping techniques and architecture-specific
tuning methodologies to realize performance closer to their peak compute
capability and memory bandwidth. In this paper, we develop architecture-aware
methods to accelerate all-pairs computations on many-core graphics processors.
Pairwise computations occur frequently in numerous application areas in
scientific computing. While they appear easy to parallelize due to the
independence of computing each pairwise interaction from all others, development
of techniques to address multi-layered memory hierarchies, mapping within the
restrictions imposed by the small and low-latency on-chip memories, striking the
right balanced between concurrency, reuse and memory traffic etc., are crucial
to obtain high-performance. We present a hierarchical decomposition scheme for
GPUs based on decomposition of the output matrix and input data. We demonstrate
that a careful tuning of the involved set of decomposition parameters is
essential to achieve high efficiency on the GPUs. We also compare the
performance of our strategies with an implementation on the STI Cell processor
as well as multi-core CPU parallelizations using OpenMP and Intel Threading
Building Blocks.

### Richard L. Martin, Maciej Haranczyk,"Exploring frontiers of high surface area metal-organic frameworks",Chemical Science,February 6, 2013,4:1781-1785,

Metal–organic frameworks (MOFs) have enjoyed considerable interest due to their high internal surface areas as well as tunable pore geometry and chemistry. However, design of optimal MOFs is a great challenge due to the significant number of possible structures. In this work, we present a strategy to rapidly explore the frontiers of these high surface area materials. Here, organic ligands are abstracted by geometrical (alchemical) building blocks, and an optimization of their defining geometrical parameters is performed to identify shapes of ligands which maximize gravimetric surface area of the resulting MOFs. A strength of our approach is that the space of ligands to be explored can be rigorously bounded, allowing discovery of the optimum ligand shape within any criteria, conforming to synthetic requirements or arbitrary exploratory limits. By modifying these bounds, we can project to what extent achievable surface area increases when moving beyond the present limits of organic synthesis. Projecting optimal ligand shapes onto real chemical species, we achieve blueprints for MOFs of various topologies that are predicted to achieve up to 70% higher surface area than the current benchmark materials.

### Wei Hu, Zhenyu Li and Jinlong Yang,"Diamond as an inert substrate of graphene",J. Chem. Phys. 138, 054701 (2013),February 1, 2013,doi: 10.1063/1.4789420

Interaction between graphene and semiconducting diamond substrate has been examined with large-scale density functional theory calculations. Clean and hydrogenated diamond (100) and (111) surfaces have been studied. It turns out that weak van der Waals interactions dominate for graphene on all these surfaces. High carrier mobility of graphene is almost not affected, except for a negligible energy gap opening at the Dirac point. No charge transfer between graphene and diamond (100) surfaces is detected, while different charge-transfer complexes are formed between graphene and diamond (111) surfaces, inducing either p-type or n-type doping on graphene. Therefore, diamond can be used as an excellent substrate of graphene, which almost keeps its electronic structures at the same time providing the flexibility of charge doping.

### Kumari Gaurav Rana, Takeaki Yajima, Subir Parui, Alexander F. Kemper, Thomas P.Devereaux, Yasuyuki Hikita, Harold Y. Hwang, Tamalika Banerjee,"Hot electron transport in a strongly correlated transition-metal oxide",Nature Scientific Reports, Volume 3, id. 1274 (2013).,February 2013,

Oxide heterointerfaces are ideal for investigating strong correlation effects to electron transport, relevant for oxide-electronics. Using hot-electrons, we probe electron transport perpendicular to the La0.7Sr0.3MnO3 (LSMO)- Nb-doped SrTiO3 (Nb:STO) interface and find the characteristic hot-electron attenuation length in LSMO to be 1.48 +/- 0.10 unit cells (u.c.) at -1.9 V, increasing to 2.02 +/- 0.16 u.c. at -1.3 V at room temperature. Theoretical analysis of this energy dispersion reveals the dominance of electron-electron and polaron scattering. Direct visualization of the local electron transport shows different transmission at the terraces and at the step-edges.

### M. Dandouna, N. Emad and L.A. Drummond,"A Proposed Programming Model for Writing Sustainable Numerical Libraries for Extreme Scale Computing",Conc. and Compt.,January 16, 2013,

The promise of computer systems with very large orders of processing elements cannot be realized without an effective solution that targets the programming model with a suitable programming environ- ment. Nowadays, it is necessary to identify and rapidly make available robust software technologies to enable high-end computer applications to run efficiently on these emerging systems, and to enable the development of more complex and capable simulation codes for scientific and engineering applica- tions. We review some of numerical libraries that have achieved modularity, scalability and extensibility thanks to their use of object-oriented programming approaches. However, only a few of these libraries have managed to effectively implement sequential and parallel code reusability.

Here, we discuss what is currently missing from existing library implementations and propose a pro- gramming model based on a modular and multi-level parallelism approach that has a strict separation between computational operations, data management and communication. We illustrate how this model makes it possible to design more scalable libraries by exploiting better their functionalities and even enable the formulation of hybrid numerical scheme to be run efficiently on multi-level parallel systems with a large number of heterogeneous processing units without confining the parallelism to the program- ming model of the communication library. We use the multiple explicitly restarted Arnoldi method as our test case and our implementations require full reuse of serial/parallel kernels in their implementation. Our experiments include comparisons with state-of-the-art numerical libraries on high-end computing systems.

### E. O. Ofek, D. Fox, S. B. Cenko, M. Sullivan, O., D. A. Frail, A. Horesh, A. Corsi, R. M., N. Gehrels, S. R. Kulkarni, A., P. E. Nugent, O. Yaron, A. V. Filippenko, M. M., L. Bildsten, J. S. Bloom, D., I. Arcavi, R. R. Laher, D. Levitan, B. Sesar, J. Surace,"X-Ray Emission from Supernovae in Dense Circumstellar Matter Environments: A Search for Collisionless Shocks",Astrophysical Journal,2013,763:42,doi: 10.1088/0004-637X/763/1/42

The optical light curve of some supernovae (SNe) may be powered by the
outward diffusion of the energy deposited by the explosion shock (the
so-called shock breakout) in optically thick (

### Michael F. Wehner,"Very extreme seasonal precipitation in the NARCCAP ensemble: model performance and projections",Climate Dynamics,January 2013,40:59-80,doi: 10.1007/s00382-012-1393-1

Seasonal extreme daily precipitation is analyzed in the ensemble of NARCAPP regional climate models. Significant variation in these models’ abilities to reproduce observed precipitation extremes over the contiguous United States is found. Model performance metrics are introduced to characterize overall biases, seasonality, spatial extent and the shape of the precipitation distribution. Comparison of the models to gridded observations that include an elevation correction is found to be better than to gridded observations without this correction. A complicated model weighting scheme based on model performance in simulating observations is found to cause significant improvements in ensemble mean skill only if some of the models are poorly performing outliers. The effect of lateral boundary conditions are explored by comparing the integrations driven by reanalysis to those driven by global climate models. Projected mid-century future changes in seasonal precipitation means and extremes are presented and discussions of the sources of uncertainty and the mechanisms causing these changes are presented.

### George Michelogiannakis, William J. Dally,"Elastic Buffer Flow Control for On-Chip Networks",Transactions on Computers,2013,

Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks, EB networks provide an up to 45% shorter cycle time, 16% more throughput per unit power or 22% more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21% more throughput per unit power than VC networks and 12% than EB networks.

### Kesheng Wu, Wes Bethel, Ming Gu, David, Oliver R\ ubel,"A Big Data Approach to Analyzing Market Volatility",Algorithmic Finance,2013,2:241--267,LBNL LBNL-6382E, doi: 10.3233/AF-13030

Understanding the microstructure of the financial market requires the processing of a vast amount of data related to individual trades, and sometimes even multiple levels of quotes. Analyzing such a large volume of data requires tremendous computing power that is not easily available to financial academics and regulators. Fortunately, public funded High Performance Computing (HPC) power is widely available at the National Laboratories in the US. In this paper we demonstrate that the HPC resource and the techniques for data-intensive sciences can be used to greatly accelerate the computation of an early warning indicator called Volume-synchronized Probability of Informed trading (VPIN). The test data used in this study contains five and a half year's worth of trading data for about 100 most liquid futures contracts, includes about 3 billion trades, and takes 140GB as text files. By using (1) a more efficient file format for storing the trading records, (2) more effective data structures and algorithms, and (3) parallelizing the computations, we are able to explore 16,000 different ways of computing VPIN in less than 20 hours on a 32-core IBM DataPlex machine. Our test demonstrates that a modest computer is sufficient to monitor a vast number of trading activities in real-time -- an ability that could be valuable to regulators.

Our test results also confirm that VPIN is a strong predictor of liquidity-induced volatility. With appropriate parameter choices, the false positive rates are about 7% averaged over all the futures contracts in the test data set. More specifically, when VPIN values rise above a threshold (CDF > 0.99), the volatility in the subsequent time windows is higher than the average in 93% of the cases.

### H.-M. Eiter, M. Lavagnini, R. Hackl, E.A. Nowadnick, A.F. Kemper, T.P. Devereaux, J.-H. Chu, J.G. Analytis, I.R. Fisher, L. Degiorgi,"Alternative route to charge density wave formation in multiband systems",Proceedings of the National Academy of Sciences,2012,doi: 10.1073/pnas.1214745110

Charge and spin density waves, periodic modulations of the electron, and magnetization densities, respectively, are among the most abundant and nontrivial low-temperature ordered phases in condensed matter. The ordering direction is widely believed to result from the Fermi surface topology. However, several recent studies indicate that this common view needs to be supplemented. Here, we show how an enhanced electron–lattice interaction can contribute to or even determine the selection of the ordering vector in the model charge density wave system ErTe3. Our joint experimental and theoretical study allows us to establish a relation between the selection rules of the electronic light scattering spectra and the enhanced electron–phonon coupling in the vicinity of band degeneracy points. This alternative proposal for charge density wave formation may be of general relevance for driving phase transitions into other broken-symmetry ground states, particularly in multiband systems, such as the iron-based superconductors.

### Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer,"Massively Parallel X-ray Scattering Simulations",Supercomputing,November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

### Mehmet Balman,"MemzNet: Memory-Mapped Zero-copy Network Channel for Moving Large Datasets over 100Gbps Networks",technical poster in ACM/IEEE international Conference For High Performance Computing, Networking, Storage and Analysis (SC'12), LBNL 6175E,November 13, 2012,doi: http://doi.ieeecomputersociety.org/10.1109/SC.Companion.2012.294

High-bandwidth networks are poised to provide new opportunities in tackling large data challenges in today's scientific applications. However, increasing the bandwidth is not sufficient by itself; we need careful evaluation of future high-bandwidth networks from the applications' perspective. We have experimented with current state-of-the-art data movement tools, and realized that file-centric data transfer protocols do not perform well with managing the transfer of many small files in high-bandwidth networks, even when using parallel streams or concurrent transfers. We require enhancements in current middleware tools to take advantage of future networking frameworks. To improve performance and efficiency, we develop an experimental prototype, called MemzNet: Memory-mapped Zero-copy Network Channel, which uses a block-based data movement method in moving large scientific datasets. We have implemented MemzNet that takes the approach of aggregating files into blocks and providing dynamic data channel management. In this work, we present our initial results in 100Gbps networks.
http://dx.doi.org/10.1109/SC.Companion.2012.294
http://dx.doi.org/10.1109/SC.Companion.2012.295

Tutorial session

### Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally,"Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks",International Conference on Computer Design,IEEE Computer Society,2012,

This paper introduces Adaptive Backpressure, a novel scheme that improves the utilization of dynamically man- aged router input buffers by continuously adjusting the stiffness of the flow control feedback loop in response to observed traffic conditions. Through a simple extension to the router’s flow control mechanism, the proposed scheme heuristically limits the number of credits available to individual virtual channels based on estimated downstream congestion, aiming to minimize the amount of buffer space that is occupied unproductively. This leads to more efficient distribution of buffer space and improves isolation between multiple concurrently executing workloads with differing performance characteristics.

Experimental results for a 64-node mesh network show that Adaptive Backpressure improves network stability, leading to an average 2.6× increase in throughput under heavy load across traffic patterns. In the presence of background traffic, the pro- posed scheme reduces zero-load latency by an average of 31 %. Finally, it mitigates the performance degradation encountered when latency- and throughput-optimized execution cores contend for network resources in a heterogeneous chip multi-processor; across a set of PARSEC benchmarks, we observe an average reduction in execution time of 34%.

### Nils E. R. Zimmermann, Berend Smit, Frerich J. Keil,"Predicting Local Transport Coefficients at Solid-Gas Interfaces",J. Phys. Chem. C,2012,116:18878-1888,doi: 10.1021/jp3059855

The regular nanoporous structure make zeolite membranes attractive candidates for separating molecules on the basis of differences in transport rates (diffusion). Since improvements in synthesis have led to membranes as thin as several hundred nanometers by now, the slow transport in the boundary layer separating bulk gas and core of the nanoporous membrane is becoming increasingly important. Therefore, we investigate the predictability of the coefficient quantifying this local process, the surface permeability α, by means of a two-scale simulation approach. Methane tracer-release from the one-dimensional nanopores of an AFI-type zeolite is employed. Besides a pitfall in determining α on the basis of tracer exchange, we, importantly, present an accurate prediction of the surface permeability using readily available information from molecular simulations. Moreover, we show that the prediction is strongly influenced by the degree of detail with which the boundary region is modeled. It turns out that not accounting for the fact that molecules aiming to escape the host structure must indeed overcome two boundary regions yields too large a permeability by a factor of 1.7–3.3, depending on the temperature. Finally, our results have far-reaching implications for the design of future membrane applications.

Watch a movie illustrating the conditions of self- or tracer-diffusion here.

### Richard L. Martin, Thomas F. Willems, Li-Chiang Lin, Jihan Kim, Joseph A. Swisher, Berend Smit & Maciej Haranczyk,"Similarity-Driven Discovery of Zeolite Materials for Adsorption-Based Separations",ChemPhysChem,Pages: 3561August 22, 2012,

A tool for identifying optimum zeolite frameworks for gas separations at a fraction of the cost of molecular simulation is presented on p. 3595 ff. by M. Haranczyk et al. The method is based on identifying property-determining substructure features and searching material databases for geometrically similar arrangements of framework atoms. The approach is deployed to screen a database an order of magnitude larger than has been examined in previous studies.

### Richard L. Martin, Thomas F. Willems, Li-Chiang Lin, Jihan Kim, Joseph A. Swisher, Berend Smit & Maciej Haranczyk,"Similarity-Driven Discovery of Zeolite Materials for Adsorption-Based Separations",ChemPhysChem,August 22, 2012,13:3595-3597,

Crystalline porous materials can be exploited in many applications. Discovery of materials with optimum adsorption properties typically involves expensive brute-force characterization of large sets of materials. An alternative approach based on similarity searching that enables discovery of materials with optimum adsorption for CO2 and other molecules at a fraction of the cost of brute-force characterization is demonstrated.

This work was featured on the front cover of the journal, available here: http://onlinelibrary.wiley.com/doi/10.1002/cphc.201290074/abstract

### Single Program, Multiple Data Programming for Hierarchical Computations,Amir Kamil,PhD,August 2012,

As performance gains in sequential programming have stagnated due to power constraints, parallel computing has become the primary tool for increasing performance. Parallel computing has long been used in scientific computing, and programmers of the future will likely face many of the same challenges that occur in programming large-scale machines. One such challenge is that of hierarchy: machines are built in a hierarchical fashion, with a wide range of communication costs between different parts of a machine, and applications such as divide-and-conquer algorithms often have hierarchical structure. Large-scale parallel machines are programmed primarily with the single program, multiple data (SPMD) model of parallelism. This model combines independent threads of execution with global collective communication and synchronization operations. Previous work has demonstrated the advantages of SPMD over other models: its simplicity enables productive programming and avoids many classes of parallel errors, and at the same time it is easy to implement and amenable to compiler analysis and optimization. Its local-view execution model allows programmers to take advantage of data locality, resulting in good performance and scalability on large-scale machines. However, it is a flat model that does not fit well with hierarchical machines or algorithms. In this dissertation, we introduce the recursive single program, multiple data (RSPMD) execution model. This model extends SPMD with hierarchical, structured teams, or groupings of threads. We design RSPMD extensions for the Titanium language, including a hierarchical team data structure and lexically-scoped constructs for operating over teams. We demonstrate that these extensions prevent erroneous use of teams that would result in deadlock. In addition, we present a runtime mechanism for ensuring proper use of both global collective operations and collectives over teams, eliminating more potential sources of deadlock. As analyzable as SPMD is, we demonstrate that RSPMD can also be analyzed precisely and efficiently. We define a hierarchical pointer analysis for determining which data a pointer can reference, as well as on which threads the referenced data may reside. We then present a series of analyses for computing the set of concurrent statements in both SPMD and RSPMD programs. We show that these analyses improve the results of multiple client analyses, including data-locality and sharing inference, race detection, and memory-model enforcement. Finally, we present application case studies demonstrating the expressiveness and performance of the RSPMD model. We show that the model enables divide-and-conquer algorithms such as sorting to be elegantly expressed, and that team collective operations increase performance of a conjugate gradient benchmark by up to a factor of two. The model also facilitates optimizations for hierarchical machines, improving scalability of a particle in cell application by 8x, performance of sorting by up to 40%, and execution time of a stencil code by as much as 14%.

### Mehmet Balman,"Analyzing Data Movements and Identifying Techniques for Next-generation High-bandwidth Networks",LBNL Tech Report,2012,LBNL 6177E,

High-bandwidth networks are poised to provide new opportunities in tackling large data challenges in today's scientific applications. However, increasing the bandwidth is not sufficient by itself; we need careful evaluation of future high-bandwidth networks from the applications’ perspective. We have investigated data transfer requirements of climate applications as a typical scientific example and evaluated how the scientific community can benefit from next generation high-bandwidth networks.  We develop a new block-based data movement method (in contrast to the current file-based methods) to improve data movement performance and efficiency in moving large scientific datasets that contain many small files. We implemented the new block-based data movement tool, which takes the approach of aggregating files into blocks and providing dynamic data channel management. One of the major obstacles in use of high-bandwidth networks is the limitation in host system resources. We have conducted a large number of experiments with our new block-based method and with current available file-based data movement tools.  In this white paper, we describe future research problems and challenges for efficient use of next-generation science networks, based on the lessons learnt and the experiences gained with 100Gbps network applications.

### Jihan Kim, Li-Chiang Lin, Richard L. Martin, Joseph A. Swisher, Maciej Haranczyk & Berend Smit,"Large-Scale Computational Screening of Zeolites for Ethane/Ethene Separation",Langmuir,July 11, 2012,28:11914–1191,

Large-scale computational screening of thirty thousand zeolite structures was conducted to find optimal structures for separation of ethane/ethene mixtures. Efficient grand canonical Monte Carlo (GCMC) simulations were performed with graphics processing units (GPUs) to obtain pure component adsorption isotherms for both ethane and ethene. We have utilized the ideal adsorbed solution theory (IAST) to obtain the mixture isotherms, which were used to evaluate the performance of each zeolite structure based on its working capacity and selectivity. In our analysis, we have determined that specific arrangements of zeolite framework atoms create sites for the preferential adsorption of ethane over ethene. The majority of optimum separation materials can be identified by utilizing this knowledge and screening structures for the presence of this feature will enable the efficient selection of promising candidate materials for ethane/ethene separation prior to performing molecular simulations.

### J. K. Freericks, A. Y. Liu, A. F. Kemper, T. P. Devereaux,"Pulsed high harmonic generation of light due to pumped Bloch oscillations in noninteracting metals",Physica Scripta,2012,T151:014062,doi: 10.1088/0031-8949/2012/T151/014062

We derive a simple theory for high-order harmonic generation due to pumping a noninteracting metal with a large amplitude oscillating electric field. The model assumes that the radiated light field arises from the acceleration of electrons due to the time-varying current generated by the pump, and also assumes that the system has a constant density of photoexcited carriers, hence it ignores the dipole excitation between bands (which would create carriers in semiconductors). We examine the circumstances under which odd harmonic frequencies would be expected to dominate the spectrum of radiated light, and we also apply the model to real materials like ZnO, for which high-order harmonic generation has already been demonstrated in experiments.

### Mehmet Balman, Eric Pouyoul, Yushu Yao, E. Wes Bethel, Burlen Loring, Prabhat, John Shalf, Alex Sim, and Brian L. Tierney,"Experiences with 100G Network Applications",In Proceedings of the Fifth international Workshop on Data-intensive Distributed Computing, in conjunction with ACM High Performance Distributing Computing (HPDC) Conference, 2012,Delft, Netherlands,June 2012,LBNL 5603E, doi: 10.1145/2286996.2287004

100Gbps networking has finally arrived, and many research and educational in- stitutions have begun to deploy 100Gbps routers and services. ESnet and Internet2 worked together to make 100Gbps networks available to researchers at the Super- computing 2011 conference in Seattle Washington. In this paper, we describe two of the first applications to take advantage of this network. We demonstrate a visu- alization application that enables remotely located scientists to gain insights from large datasets. We also demonstrate climate data movement and analysis over the 100Gbps network. We describe a number of application design issues and host tuning strategies necessary for enabling applications to scale to 100Gbps rates.

### Prabhat, Oliver Rübel, Surendra Byna, Kesheng Wu, Fuyu Li, Michael Wehner and E. Wes Bethel,"TECA: A Parallel Toolkit for Extreme Climate Analysis",Procedia Computer Science, Proceedings of the International Conference on Computational Science, ICCS 2012, Presented at Third Worskhop on Data Mining in Earth System Science (DMESS 2012),Omaha, Nebraska,June 2012,9:866–876,LBNL 5352E, doi: 10.1016/j.procs.2012.04.093

We present TECA, a parallel toolkit for detecting extreme events in large climate datasets. Modern climate datasets expose parallelism across a number of dimensions: spatial locations, timesteps and ensemble members. We design TECA to exploit these modes of parallelism and demonstrate a prototype implementation for detecting and tracking three classes of extreme events: tropical cyclones, extra-tropical cyclones and atmospheric rivers. We process a modern TB-sized CAM5 simulation dataset with TECA, and demonstrate good runtime performance for the three case studies.

### Energy-Efficient Flow-Control for On-Chip Networks,George Michelogiannakis,Stanford University,2012,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control has been proposed to address this issue by removing router buffers and handling contention by dropping or deflecting flits. In this thesis, we compare virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our study shows that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains. To provide buffering in the network but without the cost and timing overhead of router buffers, we propose elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers as well as the complexity for virtual channels (VCs) are no longer required. Therefore, EB networks have a shorter cycle time and offer more throughput per unit power than VC networks. We also propose a hybrid EB-VC router which is used to provide traffic separation for a number of traffic classes large enough for duplicate physical channels to be inefficient. These hybrid routers offer more throughput per unit power than both EB and VC routers. Finally, this thesis proposes packet chaining, which addresses the tradeoff between allocation quality and cycle time traditionally present in routers with VCs. Packet chaining is a simple and effective method to increase allocator matching efficiency to be comparable or superior to more complex and slower allocators without extending cycle time, particularly suited to networks with short packets.

### "New materials could slash energy costs for CO2 capture",Jade Boyd, David Ruth,May 30, 2012,

A detailed analysis of more than 4 million absorbent minerals has determined that new materials could help electricity producers slash as much as 30 percent of the “parasitic energy” costs associated with removing carbon dioxide from power plant emissions...

### "Computer model pinpoints prime materials for efficient carbon capture",Robert Sanders,May 27, 2012,

When power plants begin capturing their carbon emissions to reduce greenhouse gases – and to most in the electric power industry, it’s a question of when, not if – it will be an expensive undertaking...

### Li-Chiang Lin, Adam H. Berger, Richard L. Martin, Jihan Kim, Joseph A. Swisher, Kuldeep Jariwala, Chris H. Rycroft, Abhoyjit S. Bhown, Michael W. Deem, Maciej Haranczyk & Berend Smit,"In Silico Screening of Carbon Capture Materials",Nature Materials,May 27, 2012,11:633–641,

One of the main bottlenecks to deploying large-scale carbon dioxide capture and storage (CCS) in power plants is the energy required to separate the CO2 from flue gas. For example, near-term CCS technology applied to coal-fired power plants is projected to reduce the net output of the plant by some 30% and to increase the cost of electricity by 60–80%. Developing capture materials and processes that reduce the parasitic energy imposed by CCS is therefore an important area of research. We have developed a computational approach to rank adsorbents for their performance in CCS. Using this analysis, we have screened hundreds of thousands of zeolite and zeolitic imidazolate framework structures and identified many different structures that have the potential to reduce the parasitic energy of CCS by 30–40% compared with near-term technologies.

### Mahmoud K. F. Abouelnasr & Berend Smit,"Diffusion in confinement: kinetic simulations of self- and collective diffusion behavior of adsorbed gases",Physical Chemistry Chemical Physics,Pages: 11559May 18, 2012,

The relationship between the self- and collective-diffusion behavior of adsorbed gases is investigated with various simulation.

### Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer,"Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters",LBNL Tech Report,May 15, 2012,LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

### W.S. Lee, Y.D. Chuang, R.G. Moore, Y. Zhu, L. Patthey, M. Trigo, D.H. Lu, P.S. Kirchmann, O. Krupin, M. Yi, M. Langner, N. Huse, J.S. Robinson, Y. Chen, S.Y. Zhou, G. Coslovich, B. Huber, D.A. Reis, R.A. Kaindl, R.W. Schoenlein, D. Doering, P. Denes, W.F. Schlotter, J.J. Turner, S.L. Johnson, M. Först, T. Sasagawa, Y.F. Kung, A.P. Sorini, A.F. Kemper, B. Moritz, T.P. Devereaux, D.-H. Lee, Z.X. Shen & Z. Hussain,"Phase fluctuations and the absence of topological defects in a photo-excited charge-ordered nickelate",Nature Communications 3, Article number: 838,May 15, 2012,

The dynamics of an order parameter's amplitude and phase determines the collective behaviour of novel states emerging in complex materials. Time- and momentum-resolved pump-probe spectroscopy, by virtue of measuring material properties at atomic and electronic time scales out of equilibrium, can decouple entangled degrees of freedom by visualizing their corresponding dynamics in the time domain. Here we combine time-resolved femotosecond optical and resonant X-ray diffraction measurements on charge ordered La1.75Sr0.25NiO4 to reveal unforeseen photoinduced phase fluctuations of the charge order parameter. Such fluctuations preserve long-range order without creating topological defects, distinct from thermal phase fluctuations near the critical temperature in equilibrium. Importantly, relaxation of the phase fluctuations is found to be an order of magnitude slower than that of the order parameter's amplitude fluctuations, and thus limits charge order recovery. This new aspect of phase fluctuations provides a more holistic view of the phase's importance in ordering phenomena of quantum matter.

### Fuyu Li, Daniele Rosa, William D. Collins, and Michael F. Wehner,"“Super-parameterization”: A better way to simulate regional extreme precipitation?",Journal of Advances in Modeling Earth Systems,April 4, 2012,4, doi: 10.1029/2011MS000106

Extreme precipitation is generally underestimated by current climate models relative to observations of present-day rainfall distributions. Possible causes of this systematic error include the convective parameterization in these models that have been designed to reproduce measurements of climatological mean precipitation. One possible approach to improve the interaction of subgrid-scale physical processes and large-scale climate is to replace the conventional convective parameterizations with a high-resolution cloud-system resolving model. A “super-parameterized” Community Atmosphere Model (SP-CAM) utilizing this approach is used in this study to investigate the distribution of extreme precipitation in the United States. Results show that SP-CAM better simulates the distributions of both light and intense precipitation compared to the standard version of CAM based upon conventional parameterizations. The improvements are mostly seen in regions dominated by convective precipitation, suggesting that super-parameterization provides a better representation of subgrid convective processes.

### Erjun Kan, Wei Hu, Chuanyun Xiao, Ruifeng Lu, Kaiming Deng, Jinlong Yang and Haibin Su,"Half-Metallicity in Organic Single Porous Sheets",J. Am. Chem. Soc., 2012, 134 (13), 5718–5721,March 22, 2012,doi: 10.1021/ja210822c

The unprecedented applications of two-dimensional (2D) atomic sheets in spintronics are formidably hindered by the lack of ordered spin structures. Here we present first-principles calculations demonstrating that the recently synthesized dimethylmethylene-bridged triphenylamine (DTPA) porous sheet is a ferromagnetic half-metal and that the size of the band gap in the semiconducting channel is roughly 1 eV, which makes the DTPA sheet an ideal candidate for a spin-selective conductor. In addition, the robust half-metallicity of the 2D DTPA sheet under external strain increases the possibility of applications in nanoelectric devices. In view of the most recent experimental progress on controlled synthesis, organic porous sheets pave a practical way to achieve new spintronics.

### Jihan Kim, Richard L. Martin, Oliver Rübel, Maciej Haranczyk & Berend Smit,"High-throughput Characterization of Porous Materials Using Graphics Processing Units",Journal of Chemical Theory and Computation,March 16, 2012,8:1684–1693,LBNL 5409E, doi: 10.1021/ct200787v

We have developed a high-throughput graphics processing unit (GPU) code that can characterize a large database of crystalline porous materials. In our algorithm, the GPU is utilized to accelerate energy grid calculations, where the grid values represent interactions (i.e., Lennard-Jones + Coulomb potentials) between gas molecules (i.e., CH4 and CO2) and materials’ framework atoms. Using a parallel flood fill central processing unit (CPU) algorithm, inaccessible regions inside the framework structures are identified and blocked, based on their energy profiles. Finally, we compute the Henry coefficients and heats of adsorption through statistical Widom insertion Monte Carlo moves in the domain restricted to the accessible space. The code offers significant speedup over a single core CPU code and allows us to characterize a set of porous materials at least an order of magnitude larger than those considered in earlier studies. For structures selected from such a prescreening algorithm, full adsorption isotherms can be calculated by conducting multiple Grand Canonical Monte Carlo (GCMC) simulations concurrently within the GPU.

### Xiaodan Gu, Zuwei Liu, Ilja Gunkel, Slim Chourou, Sung Woo Hong, Deirdre Olynick, Thomas P. Russell,"High Aspect Ratio Sub-15nm Silicon Trenches From Block Copolymer Templates",Advanced Materials,2012,24:5688,

High-aspect-ratio sub-15-nm silicon trenches are fabricated directly from plasma etching of a block copolymer mask. A novel method that combines a block copolymer reconstruction process and reactive ion etching is used to make the polymer mask. Silicon trenches are characterized by various methods and used as a master for subsequent imprinting of different materials. Silicon nanoholes are generated from a block copolymer with cylindrical microdomains oriented normal to the surface.

### Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer,An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns,American Physical Society March Meeting 2012,March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

### Andrew Canning, Slim Chourou, Stephen Derenzo,"First-principles studies of Ce and Eu doped inorganic materials as candidates for scintillator gamma ray detectors",American Physical Society March Meeting 2012,February 2012,57,

We have performed high-throughput DFT based (GGA+U) band structure calculations for new Ce and Eu doped wide band gap inorganic materials to determine their potential as candidates for gamma ray scintillator detectors. These calculations are based on determining the 4f ground state level of the Ce and Eu relative to the valence band of the host as well as the position of the Ce and Eu 5d excited state relative to the conduction band of the host. We find many classes of candidate materials where the 5d is in the conduction band preventing scintillation. Even when the Eu and Ce 4f and 5d levels are placed well in the gap of the host, traps on the host can also prevent the energy of the gamma ray transferring to the Eu or Ce. We therefore also performed calculations for host hole traps and electron traps to compare their energies to the Ce and Eu 4f and 5d levels.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer,GISAXS simulation and analysis on GPU clusters.,American Physical Society March Meeting 2012,February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

### Richard L. Martin, Prabhat, David D. Donofrio, James A. Sethian & Maciej Haranczyk,"Accelerating Analysis of void spaces in porous materials on multicore and GPU platforms",International Journal of High Performance Computing Applications,February 5, 2012,26:347-357,

Developing computational tools that enable discovery of new materials for energy-related applications is a challenge. Crystalline porous materials are a promising class of materials that can be used for oil refinement, hydrogen or methane storage as well as carbon dioxide capture. Selecting optimal materials for these important applications requires analysis and screening of millions of potential candidates. Recently, we proposed an automatic approach based on the Fast Marching Method (FMM) for performing analysis of void space inside materials, a critical step preceding expensive molecular dynamics simulations. This breakthrough enables unsupervised, high-throughput characterization of large material databases. The algorithm has three steps: (1) calculation of the cost-grid which represents the structure and encodes the occupiable positions within the void space; (2) using FMM to segment out patches of the void space in the grid of (1), and find how they are connected to form either periodic channels or inaccessible pockets; and (3) generating blocking spheres that encapsulate the discovered inaccessible pockets and are used in proceeding molecular simulations. In this work, we expand upon our original approach through (A) replacement of the FMM-based approach with a more computationally efficient flood fill algorithm; and (B) parallelization of all steps in the algorithm, including a GPU implementation of the most computationally expensive step, the cost-grid generation. We report the acceleration achievable in each step and in the complete application, and discuss the implications for high-throughput material screening.

### Nan Jiang, Daniel U. Becker, George Michelogiannakis, William J. Dally,"Network Congestion Avoidance through Speculative Reservation",International Symposium on High Performance Computer Architecture,IEEE Computer Society,2012,

Congestion caused by hot-spot traffic can significantly degrade the performance of a computer network. In this study, we present the Speculative Reservation Protocol (SRP), a new network congestion control mechanism that relieves the effect of hot-spot traffic in high bandwidth, low latency, lossless computer networks. Compared to existing congestion control approaches like Explicit Congestion Notification (ECN), which react to network congestion through packet marking and rate throttling, SRP takes a proactive approach of congestion avoidance. Using a light-weight endpoint reservation scheme and speculative packet transmission, SRP avoids hot-spot congestion while incurring minimal overhead. Our simulation results show that SRP responds more rapidly to the onset of severe hot-spots than ECN and has a higher network throughput on bursty network traffic. SRP also performs comparably to networks without congestion control on benign traffic patterns by reducing the latency and throughput overhead commonly associated with reservation protocols.

### Nils E. R. Zimmermann, Sayee P. Balaji, Frerich J. Keil,"Surface Barriers of Hydrocarbon Transport Triggered by Ideal Zeolite Structures",J. Phys. Chem. C,2012,116:3677-3683,doi: 10.1021/jp2112389

Shedding light on the nature of surface barriers of nanoporous materials, molecular simulations (Monte Carlo, Reactive Flux) have been employed to investigate the tracer-exchange characteristics of hydrocarbons in defect-free single-crystal zeolite membranes. The concept of a critical membrane thickness as a quantitative measure of surface barriers is shown to be appropriate and advantageous. Nanopore smoothness, framework density, and thermodynamic state of the fluid phase have been identified as the most important influencing variables of surface barriers. Despite the ideal character of the adsorbent, our simulation results clearly support current experimental findings on MOF Zn(tbip) where a larger number of crystal defects caused exceptionally strong surface barriers. Most significantly, our study predicts that the ideal crystal structure without any such defects will already be a critical aspect of experimental analysis and process design in many cases of the upcoming class of extremely thin and highly oriented nanoporous membranes.

Watch here a movie that highlights how n-hexane molecules are adsorbed in a zeolite slab.

### Brian Van Straalen, David Trebotich, Terry Ligocki, Daniel T. Graves, Phillip Colella, Michael Barad,"An Adaptive Cartesian Grid Embedded Boundary Method for the Incompressible Navier Stokes Equations in Complex Geometry",LBNL Report Number: LBNL-1003767,2012,LBNL LBNL Report Numb,

We present a second-order accurate projection method to solve the
incompressible Navier-Stokes equations on irregular domains in two
and three dimensions.  We use a finite-volume discretization
obtained from intersecting the irregular domain boundary with a
Cartesian grid.  We address the small-cell stability problem
associated with such methods by hybridizing a conservative
discretization of the advective terms with a stable, nonconservative
discretization at irregular control volumes, and redistributing the
difference to nearby cells.  Our projection is based upon a
finite-volume discretization of Poisson's equation.  We use a
second-order, $L^\infty$-stable algorithm to advance in time.  Block
structured local refinement is applied in space.  The resulting
method is second-order accurate in $L^1$ for smooth problems.  We
demonstrate the method on benchmark problems for flow past a
cylinder in 2D and a sphere in 3D as well as flows in 3D geometries
obtained from image data.

### Michael F. Wehner,"Methods of Projecting Future Changes in Extremes",Extremes in a Changing Climate: Detection, Analysis and Uncertainty,edited by A. AghaKouchak et al., (Springer:2012) doi: 10.1007/978-94-007-4479-0_8

This chapter examines some selected methods of projecting changes in extreme weather and climate statistics. Indices of extreme temperature and precipitation provide measures of moderately rare weather events that are straightforward to calculate. Drought indices provide measures of both agricultural and hydrological drought that are especially suitable for constructing multi-model ensemble projections of future change. Extreme value statistical theories are surveyed and provide methodologies for projecting the changes in frequency and severity of very rare temperature and precipitation events.

Future changes in the average climate virtually guarantee that changes in extreme weather events will follow. Such rare events are best described statistically as it is difficult, but perhaps not impossible, to directly link individual disasters to human-induced climate change. Examples of extreme weather events with severe consequences to society that are amenable to projection include heat waves, cold spells, floods, droughts and tropical cyclones. Confidence in projections of future changes in the severity and frequency of such events is increased if the mechanisms of change can be identified and understood. Equally important, however, is the rigorous quantification of the uncertainties in these projections. These uncertainties include the inherent natural variability of the climate system as well as limitations in both the climate models’ fidelity and the statistical methods used to analyze their output.

The discussions about future changes in extreme events in recent climate change assessment reports (including the IPCC 4th Assessment Report and the US national assessments) did not generally focus on sophisticated statistical analyses. Rather, extremes were presented in these documents by a series of “extreme indices”. Introduced first by Frich et al. (2002), they are often referred to as the Frich indices. While many of these represent significant departures from the mean climate, they are by no means descriptive of rare events or the far tails of the temperature or precipitation distributions. The fundamental difference between these index based treatments and formal Extreme Value Theory descriptions of rare events illustrates the difficulties in nomenclature when discussing climate and weather extremes. What constitutes “extreme” varies greatly in the literature and depends highly on the application of the final results. This chapter will survey some of these methods of projecting changes in climate and weather.

### Alexandre J. Chorin, Xuemin Tu,"An iterative implementation of the implicit nonlinear filter",ESAIM: Mathematical Modelling and Numerical Analysis,2012,46:535--543,

Implicit sampling is a sampling scheme for particle filters, designed to move particles one-by-one so that they remain in high-probability domains. We present a new derivation of implicit sampling, as well as a new iteration method for solving the resulting algebraic equations.

### T.C. Peterson, R. Heim, R. Hirsch, D. Kaiser, H. Brooks, N.S. Diffenbaugh, R. Dole, J. Giovannettone, K. Guiguis, T.R. Karl, R.W. Katz, K. Kunkel, D. Lettenmaier, G. J. McCabe, C.J. Paciorek, K.Ryberg, S.Schubert, V.B.S. Silva, B. Stewart, A.V. Vecchia, G. Villarini, R.S. Vose, J. Walsh, M.Wehner, D. Wolock, K. Wolter, C.A. Woodhouse and D. Wuebbles,"Monitoring and Understanding Trends in Extreme Storms: State of Knowledge",Bulletin of the American Meteorological Society,2012,doi: 10.1175/BAMS-D-11-00262.1

The state of knowledge regarding trends and an understanding of their causes is presented for a specific subset of extreme weather and climate types. For severe convective storms (tornadoes, hail storms, and severe thunderstorms), differences in time and space of practices of collecting reports of events make the use of the reporting database to detect trends extremely difficult. Overall, changes in the frequency of environments favorable for severe thunderstorms have not been statistically significant. For extreme precipitation, there is strong evidence for a nationally-averaged upward trend in the frequency and intensity of events. The causes of the observed trends have not been determined with certainty, although there is evidence that increasing atmospheric water vapor may be one factor. For hurricanes and typhoons, robust detection of trends in Atlantic and western North Pacific tropical cyclone (TC) activity is significantly constrained by data heterogeneity and deficient quantification of internal variability. Attribution of past TC changes is further challenged by a lack of consensus on the physical linkages between climate forcing and TC activity. As a result, attribution of trends to anthropogenic forcing remains controversial. For severe snowstorms and ice storms, the number of severe regional snowstorms that occurred since 1960 was more than twice that of the preceding 60 years. There are no significant multi-decadal trends in the areal percentage of the contiguous U.S. impacted by extreme seasonal snowfall amounts since 1900. There is no distinguishable trend in the frequency of ice storms for the U.S. as a whole since 1950.

### Jon Wilkening, Jia Yu,Overdetermined shooting methods for computing standing water waves with spectral accuracy,Computational Science & Discovery,Pages: 0140172012,

A high-performance shooting algorithm is developed to compute time-periodic solutions of the free-surface Euler equations with spectral accuracy in double and quadruple precision. The method is used to study resonance and its effect on standing water waves. We identify new nucleation mechanisms in which isolated large-amplitude solutions, and closed loops of such solutions, suddenly exist for depths below a critical threshold. We also study degenerate and secondary bifurcations related to Wilton's ripples in the traveling case, and explore the breakdown of self-similarity at the crests of extreme standing waves. In shallow water, we find that standing waves take the form of counter-propagating solitary waves that repeatedly collide quasi-elastically. In deep water with surface tension, we find that standing waves resemble counter-propagating depression waves. We also discuss the existence and non-uniqueness of solutions, and smooth versus erratic dependence of Fourier modes on wave amplitude and fluid depth. In the numerical method, robustness is achieved by posing the problem as an overdetermined nonlinear system and using either adjoint-based minimization techniques or a quadratically convergent trust-region method to minimize the objective function. Efficiency is achieved in the trust-region approach by parallelizing the Jacobian computation, so the setup cost of computing the Dirichlet-to-Neumann operator in the variational equation is not repeated for each column. Updates of the Jacobian are also delayed until the previous Jacobian ceases to be useful. Accuracy is maintained using spectral collocation with optional mesh refinement in space, a high-order Runge–Kutta or spectral deferred correction method in time and quadruple precision for improved navigation of delicate regions of parameter space as well as validation of double-precision results. Implementation issues for transferring much of the computation to a graphic processing units are briefly discussed, and the performance of the algorithm is tested for a number of hardware configurations.

### George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally,"Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks",International Symposium on Microarchitecture,ACM,2011,

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a highly-tuned router using a conventional single-iteration separable iSLIP allocator, and outperforms significantly more complex allocators. Specifically, it outperforms multiple-iteration iSLIP allocators and wavefront allocators by 10% and 6% respectively, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5% compared to a single-iteration iSLIP allocator. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent chip multiprocessor.

### Richard L. Martin, Berend Smit & Maciej Haranczyk,"Addressing Challenges of Identifying Geometrically Diverse Sets of Crystalline Porous Materials",Journal of Chemical Information and Modeling,November 18, 2011,52:308–318,

Crystalline porous materials have a variety of uses, such as for catalysis and separations. Identifying suitable materials for a given application can, in principle, be done by screening material databases. Such a screening requires automated high-throughput analysis tools that calculate topological and geometrical parameters describing pores. These descriptors can be used to compare, select, group, and classify materials. Here, we present a descriptor that captures shape and geometry characteristics of pores. Together with proposed similarity measures, it can be used to perform diversity selection on a set of porous materials. Our representations are histogram encodings of the probe-accessible fragment of the Voronoi network representing the void space of a material. We discuss and demonstrate the application of our approach on the International Zeolite Association (IZA) database of zeolite frameworks and the Deem database of hypothetical zeolites, as well as zeolitic imidazolate frameworks constructed from IZA zeolite structures. The diverse structures retrieved by our method are complementary to those expected by emphasizing diversity in existing one-dimensional descriptors, e.g., surface area, and similar to those obtainable by a (subjective) manual selection based on materials’ visual representations. Our technique allows for reduction of large sets of structures and thus enables the material researcher to focus efforts on maximally dissimilar structures.

### Suren Byna, Prabhat, Michael F. Wehner and Kesheng Wu,"Detecting Atmospheric Rivers in Large Climate Datasets",Proceedings of the 2nd International Workshop on Petascale Data Analytics: Challenges, and Opportunities (PDAC-11/ Supercomputing11/ ACM/IEEE), November 14, 2011,Seattle, Washington,2011,doi: 10.1145/2110205.2110208

Extreme precipitation events on the western coast of North America are often traced to an unusual weather phenomenon known as atmospheric rivers. Although these storms may provide a significant fraction of the total water to the highly managed western US hydrological system, the resulting intense weather poses severe risks to the human and natural infrastructure through severe flooding and wind damage. To aid the understanding of this phenomenon, we have developed an efficient detection algorithm suitable for analyzing large amounts of data. In addition to detecting actual events in the recent observed historical record, this detection algorithm can be applied to global climate model output providing a new model validation methodology. Comparing the statistical behavior of simulated atmospheric river events in models to observations will enhance confidence in projections of future extreme storms. Our detection algorithm is based on a thresholding condition on the total column integrated water vapor established by Ralph et al. (2004) followed by a connected component labeling procedure to group the mesh points into connected regions in space. We develop an efficient parallel implementation of the algorithm and demonstrate good weak and strong scaling. We process a 30-year simulation output on 10,000 cores in under 3 seconds.

### Mehmet Balman, Suredra Byna,"Open Problems in network-aware data management in exa-scale computing and terabit networking era",In Proceedings of the First international Workshop on Network-Aware Data Management, in conjunction with ACM/IEEE international Conference For High Performance Computing, Networking, Storage and Analysis, 2011,Seattle, WA,November 11, 2011,LBNL 6176E, doi: http://dx.doi.org/10.1145/2110217.2110229

Accessing and managing large amounts of data is a great challenge in collaborative computing environments where resources and users are geographically distributed. Recent advances in network technology led to next-generation high- performance networks, allowing high-bandwidth connectiv- ity. Efficient use of the network infrastructure is necessary in order to address the increasing data and compute require- ments of large-scale applications. We discuss several open problems, evaluate emerging trends, and articulate our per- spectives in network-aware data management.

### Maciej Haranczyk, Chris H. Rycroft & James A. Sethian,Empty Space and New Materials: Computational Tools for Porous Materials,SIAM News,October 18, 2011,

Crystalline porous materials are some of the most important synthetic products ever made...