Careers | Phone Book | A - Z Index

Publications

Over the years, researchers in the Computer Science Department have codified their research into papers that have been published in a variety of journals or conference proceedings.  Below is a sampling of our recent work.

2017

Hongzhang Shan, Samuel Williams, Calvin Johnson, Kenneth McElvain, "A Locality-based Threading Algorithm for the Configuration-Interaction Method", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Bryce Adelstein Lelbach, Hans Johansen, Samuel Williams, "Solving Large Quantities of Small Matrix Problems on Cache-Coherent Many-Core SIMD Architectures", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Ariful Azad, Aydin Buluc, "Towards a GraphBLAS Library in Chapel", IPDPS Workshops, Orlando, FL, May 2017,

Aydin Buluc, Tim Mattson, Scott McMillan, Jose Moreira, Carl Yang, "Design of the GraphBLAS API for C", IEEE Workshop on Graph Algorithm Building Blocks, IPDPSW, 2017,

Ariful Azad, Aydin Buluc, "A work-efficient parallel sparse matrix-sparse vector multiplication algorithm", IEEE International Parallel & Distributed Processing Symposium (IPDPS), Orlando, FL, May 2017,

Ariful Azad, Mathias Jacquelin, Aydin Buluc, Esmond G. Ng, "The Reverse Cuthill-McKee Algorithm in Distributed-Memory", IEEE International Parallel & Distributed Processing Symposium (IPDPS), Orlando, FL, May 2017,

Nathan Zhang, Michael Driscoll, Armando Fox, Charles Markley, Samuel Williams, Protonu Basu, "Snowflake: A Lightweight Portable Stencil DSL", High-level Parallel Programming Models and Supportive Environments (HIPS), May 2017,

Bei Wang, Stephane Ethier, William Tang, Khaled Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Modern Gyrokinetic Particle-in-cell Simulation of Fusion Plasmas on Top Supercomputers", (to appear) International Journal of High-Performance Computing Applications (IJHPCA), May 2017,

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall, "Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers", Parallel Computing (PARCO), April 2017, doi: 10.1016/j.parco.2017.04.002

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends", Journal of Parallel and Distributed Computing (JPDC), February 2017, doi: 10.1016/j.jpdc.2017.02.010

Tan Nguyen, Pietro Cicotti, Eric Bylaska, Dan Quinlan, and Scott Baden, "Automatic Translation of MPI Source into a Latency-tolerant, Data-driven Form", Journal of Parallel and Distributed Computing, February 21, 2017,

Esmond Ng, Katherine J. Evans, Peter Caldwell, Forrest M. Hoffman, Charles Jackson, Kerstin Van Dam, Ruby Leung, Daniel F. Martin, George Ostrouchov, Raymond Tuminaro, Paul Ullrich, Stefan Wild, Samuel Williams, "Advances in Cross-Cutting Ideas for Computational Climate Science (AXICCS)", January 2017, doi: 10.2172/1341564

2016

S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

T. Nguyen, D. Unat, W. Zhang, A. Almgren, N. Farooqi, and J. Shalf, "Perilla: Metadata-based Optimizations of an Asynchronous Runtime for Adaptive Mesh Refinement", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC16), November 17, 2016,

Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

Samuel Williams, HPGMG on the Knights Landing Processor, HPGMG BoF, Supercomputing, November 2016,

Samuel Williams, HPGMG Benchmark, Top500 BoF, Supercomputing, November 2016,

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,

Hongzhang Shan, Samuel Williams, Yili Zheng, Weiqun Zhang, Bei Wang, Stephane Ethier, Zhengji Zhao, "Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication", PGAS Applications Workshop (PAW), November 2016,

William Tang, Bei Wang, Stephane Ethier, Grzegorz Kwasniewski, Torsten Hoefler, Khaled Z. Ibrahim4, Kamesh Madduri, Samuel Williams, Leonid Oliker, Carlos Rosales-Fernandez, Tim Williams, "Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide", Supercomputing, November 2016,

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Zhaoyi Meng, Alice Koniges, Yun (Helen) He, Samuel Williams, Thorsten Kurth, Brandon Cook, Jack Deslippe, and Andrea L. Bertozzi, "OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms", 12th International Workshop on OpenMP (iWOMP), October 2016, doi: 10.1007/978-3-319-45550-1_2

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, José Moreira, John Owens, Carl Yang, Marcin Zalewski, Timothy Mattson., "Mathematical foundations of the GraphBLAS", IEEE High Performance Extreme Computing (HPEC), September 1, 2016,

Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends", LBNL. - Report Number: LBNL-1005853, July 1, 2016,

Ariful Azad, Bartek Rajwa, Alex Pothen, "flowVS: Channel-Speci c Variance Stabilization in Flow Cytometry", BMC Bioinformatics, June 2016,

Douglas Doerfer, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti, "Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor", Intel Xeon Phi User Group Workshop (IXPUG), June 2016,

Weiqun Zhang, Ann Almgren, Marcus Day, Tan Nguyen, John Shalf, Didem Unat, "BoxLib with Tiling: An AMR Software Framework", SIAM Journal on Scientific Computing, 2016,

Ariful Azad, Aydın Buluç, "A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs", Parallel Computing, June 2016,

Penporn Koanantakool, Ariful Azad, Aydın Buluç, Dmitriy Morozov, Sang-Yun Oh, Leonid Oliker, Katherine Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2016,

Ariful Azad, Aydin Buluç, "Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs", IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2016,

Ariful Azad, Aydın Buluç, Alex Pothen, "Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting", IEEE Transactions on Parallel and Distributed Systems (TPDS), May 2016,

Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker, "Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers", Cray User Group (CUG), London, UK, May 2016,

Abhinav Sarje, Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers, Cray Users Group (CUG), May 12, 2016,

Ariful Azad, Aydın Buluç, Distributed-memory algorithms for cardinality matching using matrix algebra, SIAM Conference on Parallel Processing for Scientific Computing (PP), Paris, France, April 2016,

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

 

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

2015

Abhinav Sarje, Particle Swarm Optimization, DUNE Wire-Cell Reconstruction Summit, December 2015,

Samuel Williams, X-TUNE, X-Stack PI Meeting, December 2015,

Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr,
Chaitanya Aluru, Rob Egan, Leonid Oliker, Daniel Rokhsar, Katherine Yelick,
"HipMer: An Extreme-Scale De Novo Genome Assembler", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 19, 2015,

Samuel Williams, 4th Order HPGMG-FV Implementation, HPGMG BoF, Supercomputing, November 2015,

Vladimir Marjanovic, HPC Benchmarking, HPGMG BoF, Supercomputing, November 2015,

Hongzhang Shan, Kenneth McElvain, Calvin Johnson, Samuel Williams, W. Erich Ormand, "Parallel Implementation and Performance Optimization of the Configuration-Interaction Method", Supercomputing (SC), November 2015, doi: 10.1145/2807591.2807618

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick, "Implementing High-Performance Geometric Multigrid Solver With Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015,

Ariful Azad, Aydin Buluc, "Distributed-Memory Algorithms for Maximal Cardinality Matching using Matrix Algebra", IEEE Cluster, Chicago, IL, September 2015,

Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

Aydin Buluç, Scott Beamer, Kamesh Madduri, Krste Asanović, David Patterson., "Distributed-memory breadth-first search on massive graphs.", In D. Bader (editor), Parallel Graph Algorithms. CRC Press/Taylor-Francis, ( 2015)

Mahantesh Halappanavar, Alex Pothen, Ariful Azad, Fredrik Manne, Johannes Langguth, Arif Khan, "Codesign Lessons Learned from Implementing Graph Matching on Multithreaded Architectures", IEEE Computer, August 2015,

Abhinav Sarje, Parallel Performance Optimizations on Unstructured Mesh-Based Simulations, International Conference on Computational Science, June 2015,

Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker, "Parallel Performance Optimizations on Unstructured Mesh-Based Simulations", Procedia Computer Science, 1877-0509, June 2015, 51:2016-2025, doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

Scott French, Yili Zheng, Barbara Romanowicz, Katherine Yelick, "Parallel Hessian Assembly for Seismic Waveform Inversion Using Global Updates", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "merAligner: A Fully Parallel Sequence Aligner", IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015,

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

Ariful Azad, Aydin Buluç, Alex Pothen, "A Parallel Tree Grafting Algorithm for Maximum Cardinality Matching in Bipartite Graphs", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Didem Unat, Cy Chan, Weiqun Zhang, Samuel Williams, John Bachan, John Bell, John Shalf, "ExaSAT: An Exascale Co-Design Tool for Performance Modeling", International Journal of High Performance Computing Applications (IJHPCA), May 2015, doi: 10.1177/1094342014568690

Abhinav Sarje, Recovering Structural Information about Nanoparticle Systems, Nvidia GPU Technology Conference, March 19, 2015,

The inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at light-source synchrotrons is an ideal example of a Big Data and Big Compute application. This session will give an introduction and overview to this problem and its solutions as being developed at the Berkeley Lab. X-ray scattering based extraction of structural information from material samples is an important tool applicable to numerous applications such as design of energy-relevant nano-devices. We exploit the use of parallelism available in clusters of GPUs to gain efficiency in the reconstruction process. To develop a solution, we apply Particle Swarm Optimization (PSO) in a massively parallel fashion, and develop high-performance codes and analyze the performance.

Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer, "Recovering Nanostructures from X-Ray Scattering Data", Nvidia GPU Technology Conference (GTC), March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.

M. Chabbi, W. Lavrijsen, W.A. de Jong, K. Sen, J. Mellor-Crummey, C. Iancu, "Barrier elision for production parallel programs", PPoPP ’15: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, ACM, February 7, 2015, 109-119, doi: 10.1145/2688500.2688502

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Aydin Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, Christian Schulz., "Recent advances in graph partitioning", ArXiv, ( 2015)

Thorsten Kurth, Andrew Pochinsky, Abhinav Sarje, Sergey Syritsyn, Andre Walker-Loud, "High-Performance I/O: HDF5 for Lattice QCD", arXiv:1501.06992, January 2015,

Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design and implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.

2014

Khaled Z. Ibrahim, Samuel W. Williams, Evgeny Epifanovsky, Anna I. Krylov, "Analysis and Tuning of Libtensor Framework on Multicore Architectures", High Performance Computing Conference (HIPC), December 2014,

Sisi Duan, Hein Meling, Sean Peisert, Haibin Zhang,, "BChain: Byzantine Replication with High Throughput and Embedded Reconfiguration", Proceedings of the 18th International Conference on Principles of Distributed Systems (OPODIS), Cortina, Italy, Springer, December 2014, 91-106, doi: 10.1007/978-3-319-14472-6_7

Samuel Williams, HPGMG-FV, FastForward2 Proxy App Presentation, December 2014,

Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "Parallel de bruijn graph construction and traversal for de novo genome assembly", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'14), November 2014,

J.A. Ang, R.F. Barrett, R.E. Benner, D. Burke, C. Chan, D. Donofrio, S.D. Hammond, K.S. Hemmertand S.M. Kelly, H. Le, V.J. Leung, D.R. Resnick, A.F. Rodrigues, J. Shalf, D. Stark, andN.J. Wright D. Unat, "Abstract Machine Models and Proxy Architectures for Exascale Computing", Co--HPC2014 (to appear), New Orleans, LA, USA, IEEE Computer Society, November 17, 2014,

To achieve Exascale computing, fundamental hardware architectures must change. The most significant consequence of this assertion is the impact on the scientific applications that run on current High Performance Computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to Exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the Exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to ellucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware architects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of models that can help software developers prepare for Exascale and through the use of proxy architectures, we can enable a more concrete exploration of how well application codes map onto the future architectures. 

Alex Druinsky, Brian Austin, Sherry Li, Osni Marques, Eric Roman, Samuel Williams, "A Roofline Performance Analysis of an Algebraic Multigrid Solver", Supercomputing (SC), November 2014,

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/978-3-319-17248-4_7

Veronika Strnadova, Aydın Buluç, Joseph Gonzalez, Stefanie Jegelka, Jarrod Chapman, John Gilbert, Daniel Rokhsar, Leonid Oliker, "Efficient and accurate clustering for large-scale genetic mapping", IEEE International Conference on Bioinformatics and Biomedicine (BIBM'14), November 1, 2014,

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlic, Vivek Sarkar, "HabaneroUPC++: a Compiler-free PGAS Library", 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014,

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E,

George Michelogiannakis, John Shalf, Variable-Width Datapath for On-Chip Network Static Power Reduction, 8th International Symposium on Networks-on-Chip, September 2014,

George Michelogiannakis, John shalf, "Variable-Width Datapath for On-Chip Network Static Power Reduction", 8th International Symposium on Networks-on-Chip (NOCS), September 2014,

  • Download File: abn.pdf (pdf: 277 KB)

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

Didem Unat, George Michelogiannakis, John Shalf, The Role of Modeling in Locality Optimizations, Modeling and simulation workshop (MODSIM), August 2014,

George Michelogiannakis, Collective Memory Transfers for Multi-Core Chips, International Conference on Supercomputing (ICS), June 2014,

Amir Kamil, Yili Zheng, Katherine Yelick, "A Local-View Array Library for Partitioned Global Address Space C++ Programs", ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, June 2014,

Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf, "Collective Memory Transfers for Multi-Core Chips", International Conference on Supercomputing (ICS), June 2014, doi: 10.1145/2597652.2597654

Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, Katherine Yelick, "UPC++: A PGAS Extension for C++", International Parallel and Distributed Processing Symposium (IPDPS), May 2014,

J.A. Ang, R.F. Barrett, R.E. Benner, D. Burke, C. Chan, D. Donofrio, S.D. Hammond, K.S. Hemmert, S.M. Kelly, H. Le, V.J. Leung, D.R. Resnick, A.F. Rodrigues, J. Shalf, D. Stark, D. Unat, N.J. Wright, "Abstract Machine Models and Proxy Architectures for Exascale Computing", May 16, 2014,

H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse Matrix-Multiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

2013

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

George Michelogiannakis, Channel Reservation Protocol for Over-Subscribed Channels and Destinations, Conference on High Performance Computing Networking, Storage and Analysis, 2013,

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Channel Reservation Protocol for Over-Subscribed Channels and Destinations", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2013,

Didem Unat, Cy Chan, Weiqun Zhang, John Bell and John Shalf, Tiling as a Durable Abstraction for Parallelism and Data Locality, Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, November 18, 2013,

Hongzhang Shan, Brian Austin, Wibe de Jong, Leonid Oliker, Nick Wright, Edoardo Apra, "Performance Tuning of Fock Matrix and Two Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2013, doi: 10.1007/978-3-319-10214-6_13

Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Kinetic Turbulence Simulations at Extreme Scale on Leadership-Class Systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2013, doi: 10.1145/2503210.2503258

Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

George Michelogiannakis, Hardware Support for Collective Memory Transfers in Stencil Computations, Workshop on Optimizing Stencil Computations, October 2013,

George Michelogiannakis, Extending Summation Precision for Distributed Network Operations, 25th International Symposium on Computer Architecture and High Performance Computing, October 2013,

George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems. 

A. Buluç, K. Madduri, "Graph partitioning for scalable distributed graph computations", AMS Contemporary Mathematics, Graph Partitioning and Graph Clustering (Proc. 10th DIMACS Implementation Challenge), 2013,

Samuel Williams, At Exascale, Will Bandwidth Be Free?, DOE ModSim Workshop, 2013,

Tim Mattson, David Bader, Jon Berry, Aydin Buluc, Jack Dongarra, Christos Faloutsos, John Feo, John Gilbert, Joseph Gonzalez, Bruce
Hendrickson, Jeremy Kepner, Charles Leiserson, Andrew Lumsdaine, David Padua, Stephen Poole, Steve Reinhardt, Mike Stonebraker, Steve Wallach,
Andrew Yoo,
"Standards for Graph Algorithm Primitives", HPEC, 2013,

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

Khaled Z Ibrahim, Kamesh Madduri, Samuel Williams, Bei Wang, Stephane Ethier, Leonid Oliker, "Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms", International Journal of High Performance Computing Applications (IJHPCA), July 2013, doi: 10.1177/1094342013492446

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, L. Oliker, "Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid", Workshop on Stencil Computations (WOSC), 2013,

Cy Chan, Didem Unat, Michael Lijewski, Weiqun Zhang, John Bell, John Shalf, "Software Design Space Exploration for Exascale Combustion Co-Design", International Supercomputing Conference (ISC), Leipzig, Germany, June 16, 2013,

Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen, Sam Williams, "Loop chaining: A programming abstraction for balancing locality and parallelism", Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, May 2013, 375--384, doi: 10.1109/IPDPSW.2013.68

Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

E. Solomonik, A. Buluç, J. Demmel, "Minimizing communication in all-pairs shortest paths", International Parallel and Distributed Processing Symposium (IPDPS), 2013,

Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, John Kim, William J. Dally, "A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator", International Symposium on Performance Analysis of Systems and Software, IEEE Computer Society, April 2013,

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

George Michelogiannakis, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", Transactions on Computers, 2013,

Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks, EB networks provide an up to 45% shorter cycle time, 16% more throughput per unit power or 22% more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21% more throughput per unit power than VC networks and 12% than EB networks.

2012

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker,
"Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Samuel Williams, Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors, Supercomputing (SC), November 2012,

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

Michael Garland, Manjunath Kudlur, Yili Zheng, "Designing a Unified Programming Model for Heterogeneous Machines", Supercomputing (SC), November 2012,

Evangelos Georganas, Jorge González-Domínguez, Edgar Solomonik, Yili Zheng, Juan Touriño, Katherine Yelick, "Communication Avoiding and Overlapping for Numerical Linear Algebra", Supercomputing (SC), November 2012,

S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2012, doi: 10.1109/SC.2012.85

B. Wang, S. Ethier, W. Tang, K. Ibrahim, K. Madduri, S. Williams, "Advances in gyrokinetic particle in cell simulation for fusion plasmas to Extreme scale", Supercomputing (SC), 2012,

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally, "Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks", International Conference on Computer Design, IEEE Computer Society, 2012,

This paper introduces Adaptive Backpressure, a novel scheme that improves the utilization of dynamically man- aged router input buffers by continuously adjusting the stiffness of the flow control feedback loop in response to observed traffic conditions. Through a simple extension to the router’s flow control mechanism, the proposed scheme heuristically limits the number of credits available to individual virtual channels based on estimated downstream congestion, aiming to minimize the amount of buffer space that is occupied unproductively. This leads to more efficient distribution of buffer space and improves isolation between multiple concurrently executing workloads with differing performance characteristics.

Experimental results for a 64-node mesh network show that Adaptive Backpressure improves network stability, leading to an average 2.6× increase in throughput under heavy load across traffic patterns. In the presence of background traffic, the pro- posed scheme reduces zero-load latency by an average of 31 %. Finally, it mitigates the performance degradation encountered when latency- and throughput-optimized execution cores contend for network resources in a heterogeneous chip multi-processor; across a set of PARSEC benchmarks, we observe an average reduction in execution time of 34%.

K. Madduri, J. Su, S. Williams, L. Oliker, S. Ethier, K. Yelick, "Optimization of Parallel Particle-to-Grid Interpolation on Leading Multicore Platforms", Transactions on Parallel and Distributed Systems (TPDS), October 1, 2012, doi: 10.1109/TPDS.2012.28

K. Kandalla, A. Buluç, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, D. K. Panda, "Can network-offload based non-blocking neighborhood MPI collectives improve communication overheads of irregular graph algorithms?", International Workshop on Parallel Algorithms and Parallel Software (IWPAPS 2012), 2012,

Jun Zhou, Didem Unat, Dong Ju Choi, Clark C. Guest, Yifeng Cui, "Hands-on Performance Tuning of 3D Finite Difference Earthquake Simulation on GPU Fermi Chipset", Procedia CS, 2012, Vol 9:976-985,

J. Krueger, P. Micikevicius, S. Williams, "Optimization of Forward Wave Modeling on Contemporary HPC Architectures", LBNL Technical Report, 2012, LBNL 5751E,

Paul H. Hargrove, UPC Language Full-day Tutorial, Workshop at UC Berkeley, July 12, 2012,

Hongzhang Shan, Erich Strohmaier, James Amundson, Eric G. Stern, "Optimizing The Advanced Accelerator Simulation Framework Synergia Using OpenMP", IWOMP'12 Proceedings of the 8th International COnference on OpenMP, June 11, 2012,

A. Buluç, J. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments", SIAM Journal on Scientific Computing (SISC), 2012,

Energy-Efficient Flow-Control for On-Chip Networks, George Michelogiannakis, Stanford University, 2012,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control has been proposed to address this issue by removing router buffers and handling contention by dropping or deflecting flits. In this thesis, we compare virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our study shows that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains. To provide buffering in the network but without the cost and timing overhead of router buffers, we propose elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers as well as the complexity for virtual channels (VCs) are no longer required. Therefore, EB networks have a shorter cycle time and offer more throughput per unit power than VC networks. We also propose a hybrid EB-VC router which is used to provide traffic separation for a number of traffic classes large enough for duplicate physical channels to be inefficient. These hybrid routers offer more throughput per unit power than both EB and VC routers. Finally, this thesis proposes packet chaining, which addresses the tradeoff between allocation quality and cycle time traditionally present in routers with VCs. Packet chaining is a simple and effective method to increase allocator matching efficiency to be comparable or superior to more complex and slower allocators without extending cycle time, particularly suited to networks with short packets.

Mads Kristensen, Yili Zheng, Brian Vinter, "PGAS for Distributed Numerical Python Targeting Multi-core Clusters", IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2012,

Didem Unat, Jun Zhou, Yifeng Cui, Scott B. Baden, Xing Cai, "Accelerating a 3D Finite Difference Earthquake Simulation with a C-to-CUDA Translator", Computing in Science and Engineering, May 2012, Vol 14:48-59,

A. Lugowski, D. Alber, A. Buluç, J. Gilbert, S. Reinhardt, Y. Teng, A. Waranis, "A flexible open-source toolbox for scalable complex graph analysis", SIAM Conference on Data Mining (SDM), 2012,

A. Lugowski, A. Buluç, J. Gilbert, S. Reinhardt, "Scalable complex graph analysis with the knowledge discovery toolbox", International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2012,

Abhinav Sarje, Next-Generation Scientific Computing with Graphics Processors, Beijing Computational Science Research Center, February 2012,

Mitesh R. Meswani, Laura Carrington, Didem Unat, Allan Snavely, Scott B. Baden, Stephen Poole, "Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators", IPDPS Workshops, IEEE Computer Society, 2012,

Nan Jiang, Daniel U. Becker, George Michelogiannakis, William J. Dally, "Network Congestion Avoidance through Speculative Reservation", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2012,

Congestion caused by hot-spot traffic can significantly degrade the performance of a computer network. In this study, we present the Speculative Reservation Protocol (SRP), a new network congestion control mechanism that relieves the effect of hot-spot traffic in high bandwidth, low latency, lossless computer networks. Compared to existing congestion control approaches like Explicit Congestion Notification (ECN), which react to network congestion through packet marking and rate throttling, SRP takes a proactive approach of congestion avoidance. Using a light-weight endpoint reservation scheme and speculative packet transmission, SRP avoids hot-spot congestion while incurring minimal overhead. Our simulation results show that SRP responds more rapidly to the onset of severe hot-spots than ECN and has a higher network throughput on bursty network traffic. SRP also performs comparably to networks without congestion control on benign traffic patterns by reducing the latency and throughput overhead commonly associated with reservation protocols.

Han Suk Kim, Didem Unat, Scott B. Baden, Jürgen P. Schulze, "Interactive Data-centric Viewpoint Selection", Visualization and Data Analysis, Proc. SPIE 8294, January 2012,

Benjamin Edwards, Steven Hofmeyr, George Stelle, Stephanie Forrest, "Internet topology over time", arXiv preprint arXiv:1202.3993, January 1, 2012,

A. Napov, "Conditioning Analysis of Incomplete Cholesky Factorizations with Orthogonal Dropping", 2012, LBNL 5353E,

A. Napov and Y. Notay, "An Algebraic Multigrid Method with Guaranteed Convergence Rate", SIAM J. Sci. Comput., vol.43, pp. A1079-A1109, 2012,

Benjamin Edwards, Tyler Moore, George Stelle, Steven Hofmeyr, Stephanie Forrest, "Beyond the blacklist: modeling malware spread and the effect of interventions", Proceedings of the 2012 workshop on New security paradigms, January 1, 2012, 53--66,

2011

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks", International Symposium on Microarchitecture, ACM, 2011,

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a highly-tuned router using a conventional single-iteration separable iSLIP allocator, and outperforms significantly more complex allocators. Specifically, it outperforms multiple-iteration iSLIP allocators and wavefront allocators by 10% and 6% respectively, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5% compared to a single-iteration iSLIP allocator. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent chip multiprocessor.

George Michelogiannakis, Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks, International Symposium on Microarchitecture, 2011,

Mitesh R. Meswani, Laura Carrington, Didem Unat, Joshua Peraza, Allan Snavely, Scott Baden, Stephen Poole, "Modeling and Predicting Application Performance on Hardware Accelerators", International Symposium on Workload Characterization (IISWC), IEEE, November 2011,

A. Buluç, K. Madduri, "Parallel breadth-first search on distributed memory systems", Supercomputing (SC), November 2011,

S. Williams, et al., Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning, Supercomputing (SC), 2011,

Matt Bishop, Justin Cummins, Sean Peisert, Bhume Bhumitarana, Anhad Singh, Deborah Agarwal, Deborah Frincke, Michael Hogarth, "Relationships and Data Sanitization: A Study in Scarlet", Proceedings of the 2010 New Security Paradigms Workshop (NSPW), Concord, MA, ACM, September 2011, 151-164, doi: 10.1145/1900546.1900567

S. Williams, et al., Stencil Computations on CPUs, Stanford Earth Sciences Algorithms and Architectures Initiative (SESAAI), 2011,

S. Williams, et al., Performance Optimization of HPC Applications on Multi- and Manycore Processors, Workshop on Hybrid Technologies for NASA Applications, 4th Internation Conference on Space Mission Challenges for Information Technology, 2011,

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

S. Williams, et al, Stencil Computations on CPUs, Society of Exploration Geophysicists High-Performance Computing Workshop (SEG), July 2011,

Paul H. Hargrove, UPC Language Half-day Tutorial, Workshop at UC Berkeley, June 15, 2011,

S. Hofmeyr, T. Moore, S. Forrest, B. Edwards, G. Stelle, "Modeling Internet scale policies for cleaning up malware", Workshop on the Economics of Information Security (WEIS 2011), June 14, 2011,

Chang-Seo Park, Koushik Sen, Paul Hargrove, Costin Iancu, "Efficient data race detection for distributed memory parallel programs", Supercomputing (SC), 2011,

Didem Unat, Xing Cai, Scott B. Baden, "Mint: realizing CUDA performance in 3D stencil methods with annotated C", ICS '11 Proceedings of the international conference on Supercomputing, ACM, June 2011, 214-224,

Sean Whalen, Sean Peisert, Matt Bishop, "Network-Theoretic Classification of Parallel Computation Patterns", Proceedings of the First International Workshop on Characterizing Applications for Heterogeneous Exascale Systems (CACHES), Tucson, AZ, IEEE Computer Society, June 4, 2011,

Didem Unat, Han Suk Kim, Jurgen Schulze, Scott Baden, Auto-optimization of a feature selection algorithm, Emerging Applications and Many-Core Architecture, June 2011,

A. Buluç, J. Gilbert, "The Combinatorial BLAS: Design, implementation, and applications", International Journal of High-Perormance Computing Applications (IJHPCA), 2011,

A. Buluç, S. Williams, L. Oliker, J. Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication", International Parallel Distributed Processing Symposium (IPDPS), May 2011, doi: 10.1109/IPDPS.2011.73

P. Narayanan, A. Koniges, L. Oliker, R. Preissl, S. Williams, N. Wright, M. Umansky, X. Xu, S. Ethier, W. Wang, J. Candy, J. Cary, "Performance Characterization for Fusion Co-design Applications", Cray Users Group (CUG), May 2011,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

A. Napov and Y. Notay, "Smoothing Factor, Order of Prolongation and Actual Multigrid Convergence", Numerische Mathematik , vol.118, pp. 457-483, 2011,

A. Napov and Y. Notay, "Algebraic Analysis of Aggregation-Based Multigrid", Numer. Lin. Alg. Appl., vol.18, pp. 539-564, 2011,

the short version of the paper, winner of the Student Paper competition of 11th Copper Mountain Conference on Iterative Methods

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)

Rajesh Nishtala, Yili Zheng, Paul Hargrove, Katherine A. Yelick, "Tuning collective communication for Partitioned Global Address Space programming models", Parallel Computing, 2011, 37(9):576-591,

David H. Bailey, Lin-Wang Wang, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, Byounghak Lee, "Tuning an electronic structure code", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2011) Pages: 339-354

M. Wehner, L. Oliker, J. Shalf, D. Donofrio, L. Drummond, et al., "Hardware/Software Co-design of Global Cloud System Resolving Models", Journal of Advances in Modeling Earth Systems (JAMES), 2011, 3, M1000:22, doi: 10.1029/2011MS000073

Kamesh Madduri, Khaled Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker, "Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 23, doi: 10.1145/2063384.2063415

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482

"Emerging Programming Paradigms for Large-Scale Scientific Computing", Guest editors, Parallel Computing special issue,'Emerging Programming Paradigms for Large-Scale Scientific Computing", 2011,

Kamesh Madduri, Eun-Jin Im, Khaled Z. Ibrahim, Samuel Williams, Stephane Ethier, Leonid Oliker, "Gyrokinetic Particle-in-cell Optimization on Emerging Multi- and Manycore Platforms", Parallel Computing (PARCO), January 2011, 37:501 - 520, doi: 10.1016/j.parco.2011.02.001

R. Sudarsan, J. Borrill, C. Cantalupo, T. Kisner, K. Madduri, L. Oliker, Y. Zheng, H. Simon, "Cosmic microwave background map-making at the petascale and beyond", Proceedings of the International Conference on Supercomputing, 2011, 305-316, doi: 10.1145/1995896.1995944

2010

Khaled Z. Ibrahim, Erich Strohmaier, "Characterizing the Relation Between Apex-Map Synthetic Probes and Reuse Distance Distributions", The 39th International Conference on Parallel Processing (ICPP), 2010, 353 -362,

Samuel W. Williams, David H. Bailey, "Parallel Computer Architecture", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010) Pages: 11-33

S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators", chapter in Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, Jakub Kurzak, ( 2010)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-tuning Stencil Computations on Multicore and Accelerators", Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, ( 2010)

S. Williams, "The Roofline Model", chapter in Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010)

Hongzhang Shan, Erich Strohmaier, "Developing a Parameterized Performance Proxy for Sequential Scientific Kernels", 12th IEEE International Conference on High Performance Computing and Communications (HPCC), 2010, September 1, 2010,

G. Hendry, J, Chan, S, Kamil, L. Oliker , J. Shalf, L. Carloni , K. Bergman, "Silicon Nanophotonic Network-On-Chip using TDM Arbitration", Hot Interconnects, August 2010,

Testing

S. Williams, et al., Lattice Boltzmann Hybrid Auto-tuning on High-End Computational Platforms, Worshop on Programming Environments for Emerging Parallel Systems (PEEPS), 2010,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

J. A. Colmenares, S. Bird, H. Cook, P. Pearce, D. Zhu, J. Shalf, S. Hofmeyr, K. Asanovic, J. Kubiatowicz, "Resource Management in the Tessellation Manycore OS", 2nd Usenix Workshop on Hot Topics in Parallelism (HotPar), June 15, 2010,

David Donofrio, Leonid Oliker, John Shalf, Michael F. Wehner, Daniel Burke, John Wawrzynek, "Project Green Flash---Design and Emulate A Low-‐Power CPU for a New Climate-‐Modeling Supercomputer", Design Automation Conference (DAC47), 2010,

S. Ethier, M. Adams, J. Carter, L. Oliker, "Petascale Parallelization of the Gyrokinetic Toroidal Code", VECPAR: High Performance Computing for Computational Science, June 2010,

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

Sean Peisert, "Fingerprinting Communication and Computation on HPC Machines", Lawrence Berkeley National Laboratory Technical Report, June 2010, LBNL LBNL-3483E,

George Michelogiannakis, Evaluating Bufferless Flow Control for On-chip Networks, International Symposium on Networks-on-Chip, 2010,

George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis, "Evaluating Bufferless Flow Control for On-chip Networks", International Symposium on Networks-on-Chip, IEEE Computer Society, 2010,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control addresses this issue by removing router buffers, and handles contention by dropping or deflecting flits. This work compares virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our evaluation includes optimizations for both schemes: buffered networks use custom SRAM-based buffers and empty buffer bypassing for energy efficiency, while bufferless networks feature a novel routing scheme that reduces average latency by 5%. Results show that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains: bufferless designs are only marginally (up to 1.5%) more energy efficient at very light loads, and buffered networks provide lower latency and higher throughput per unit power under most conditions.

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Daniel Sanchez, George Michelogiannakis, Christos Kozyrakis, "An Analysis of Interconnection Networks for Large Scale Chip Multiprocessors", Transactions on Architecture and Code Optimization, 2010,

With the number of cores of chip multiprocessors (CMPs) rapidly growing as technology scales down, connecting the different components of a CMP in a scalable and efficient way becomes increasingly challenging. In this article, we explore the architectural-level implications of interconnection network design for CMPs with up to 128 fine-grain multithreaded cores. We evaluate and compare different network topologies using accurate simulation of the full chip, including the memory hierarchy and interconnect, and using a diverse set of scientific and engineering workloads.

We find that the interconnect has a large impact on performance, as it is responsible for 60% to 75% of the miss latency. Latency, and not bandwidth, is the primary performance constraint, since, even with many threads per core and workloads with high miss rates, networks with enough bandwidth can be efficiently implemented for the system scales we consider. From the topologies we study, the flattened butterfly consistently outperforms the mesh and fat tree on all workloads, leading to performance advantages of up to 22%. We also show that considering interconnect and memory hierarchy together when designing large-scale CMPs is crucial, and neglecting either of the two can lead to incorrect conclusions. Finally, the effect of the interconnect on overall performance becomes more important as the number of cores increases, making interconnection choices especially critical when scaling up.

Yili Zheng, Filip Blagojevic, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Costin Iancu, Seung-Jai Min, Katherine Yelick, Getting Multicore Performance with UPC, SIAM Conference on Parallel Processing for Scientific Computing, February 2010,

Samuel Williams, Kaushik Datta, Leonid Oliker, Jonathan Carter, John Shalf, Katherine Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Performance Tuning of Scientific Applications, edited by D. H. Bailey, R. F. Lucas, S. W. Williams, (CRC Press: 2010) Pages: 219

A. Napov and Y. Notay, "When Does Two-Grid Optimality Carry Over to the V-Cycle?", Numer. Lin. Alg. Appl., vol.17, pp. 273-290, 2010,

A. Napov and Y. Notay, "Comparison of Bounds for V-Cycle Multigrid", Appl. Numer. Math., vol.60, pp.176-192, 2010,

Algebraic Analysis of V-Cycle Multigrid and Aggregation-Based Two-Grid Methods, A. Napov, 2010,

A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc, "Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures", International Parallel & Distributed Processing Symposium (IPDPS), 2010, doi: 10.1109/IPDPS.2010.5470415

Andrew Uselton, Howison, J. Wright, Skinner, Keen, Shalf, L. Karavanic, Leonid Oliker, "Parallel I/O performance: From events to ensembles", International Parallel & Distributed Processing Symposium (IPDPS), 2010, 1-11,

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

John Shalf, Donofrio, Rowen, Oliker, Michael F. Wehner, "Green Flash: Climate Machine (LBNL)", Encyclopedia of Parallel Computing, (Springer: 2010) Pages: 809-819

Green Flash is a research project focused on an application-driven manycore chip design that leverages commodity-embedded circuit designs and hardware/software codesign processes to create a highly programmable and energy-efficient HPC design. The project demonstrates how a multidisciplinary hardware/software codesign process that facilitates close interactions between applications scientists, computer scientists, and hardware engineers can be used to develop a system tailored for the requirements of scientific computing.

Khaled Z. Ibrahim, "Bridging the Gap Between Complex Software Paradigms and Power-efficient Parallel Architectures", International Conference on Green Computing, 2010, 417-424,

L. Oliker, J. Carter, V. Beckner, J. Bell, H. Wasserman, M. Adams, S. Ethier, E. Schnetter, "Large-Scale Numerical Simulations on High-End Computational Platforms", Chapman & Hall/CRC Computational Science, edited by D. H. Bailey, R. F. Lucas, S. W. Williams, (CRC Press: 2010) Pages: 123

2009

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

George Michelogiannakis, William J. Dally, "Router Designs for Elastic Buffer On-Chip Networks", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2009,

This paper explores the design space of elastic buffer (EB) routers by evaluating three representative designs. We propose an enhanced two-stage EB router which maximizes throughput by achieving a 42% reduction in cycle time and 20% reduction in occupied area by using look-ahead routing and replacing the three-slot output EBs in the baseline router of [17] with two-slot EBs. We also propose a singlestage router which merges the two pipeline stages to avoid pipelining overhead. This design reduces zero-load latency by 24% compared to the enhanced two-stage router if both are operated at the same clock frequency; moreover, the single-stage router reduces the required energy per transferred bit and occupied area by 29% and 30% respectively, compared to the enhanced two-stage router. However, the cycle time of the enhanced two-stage router is 26% smaller than that of the single-stage router.

Kamesh Madduri, Samuel Williams, Stephane Ethier, Leonid Oliker, John Shalf, Erich Strohmaier, Katherine Yelick, "Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2009, doi: 10.1145/1654059.1654108

J. Shalf, M. Wehner, L. Oliker, "The Challenge of Energy-Efficient HPC", SCIDAC Review, Fall, 2009,

M. Wehner, L. Oliker., and J. Shalf, "Low Power Supercomputers", IEEE Spectrum, October 2009,

High-performance computing for such things as climate modeling is not going to advance at anything like the pace it has during the last two decades unless we apply fundamentally new ideas. Here we describe one possible approach. Rather than constructing supercomputers from the kinds of microprocessors found in fast desktop computers or servers, we propose adopting designs and design principles drawn, oddly enough, from the portable-electronics marketplace.

S. Zhou, D. Duffy, T. Clune, M. Suarez, S. Williams, M. Halem, "The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models", Concurrency and Computation:Practice and Experience (CCPE), August 2009, doi: 10.1002/cpe.1482

Zhengji Zhao, Juan Meza, Byounghak Lee, Hongzhang Shan, Eric Strohmaier, David H. Bailey, Lin-Wang Wang, "The linearly scaling 3D fragment method for large scale electronic structure calculations", Journal of Physics: Conference Series, July 1, 2009,

Paul Hargrove, Brock Palen, Jeff Squyres, RCE 12: BLCR, RCE Podcast (interview), June 19, 2009,

Brock Palen and Jeff Squyres speak with Paul Hargrove of the Berkley Labratory Checkpoint Restart (BLCR) project

R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovic, J. D. Kubiatowicz, "Tessellation: Space-Time Partitioning in a Manycore Client OS", First USENIX Workshop on Hot Topics in Parallelism, June 15, 2009,

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Best Paper Award

Paul Hargrove, Jason Duell, Eric Roman, Berkeley Lab Checkpoint/Restart (BLCR): Status and Future Plans, Dagstuhl Seminar: Fault Tolerance in High-Performance Computing and Grids, May 2009,

S. Williams, et al., A Generalized Framework for Auto-tuning Stencil Computations, Cray User Group (CUG), 2009,

S. Williams, et al., Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4, Cray User Group (CUG), 2009,

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

S. Williams, A. Waterman, D. Patterson, "Roofline: an insightful visual performance model for multicore architectures", Communications of the ACM (CACM), April 2009, doi: 10.1145/1498765.1498785

Paul Hargrove, Jason Duell, Eric Roman, System-level Checkpoint/Restart with BLCR, TeraGrid 2009 Fault Tolerance Workshop, March 19, 2009,

George Michelogiannakis, Elastic Buffer Flow Control for On-Chip Networks, International Symposium on High Performance Computer Architecture, 2009,

George Michelogiannakis, James Balfour, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2009,

This paper presents elastic buffers (EBs), an efficient flow-control scheme that uses the storage already present in pipelined channels in place of explicit input virtual-channel buffers (VCBs). With this approach, the channels themselves act as distributed FIFO buffers. Without VCBs, and hence virtual channels (VCs), deadlock prevention is achieved by duplicating physical channels. We develop a channel occupancy detector to apply universal globally adaptive load-balancing (UGAL) routing to load balance traffic in networks using EBs. Using EBs results in up to 8% (12% for low-swing channels) improvement in peak throughput per unit power compared to a VC flow-control network. These gains allow for a wider network datapath to be used to offset the removal of VCBs and increase throughput for a fixed power budget. EB networks have identical zero-load latency to VC networks operating under the same frequency. The microarchitecture of an EB router is considerably simpler than a VC router because allocators and credits are not required. For 5 times 5 mesh routers, this results in an 18% improvement in the cycle time.

John Shalf, Challenges of Energy Efficient Scientific Computing, 2009,

R. Biswas, J. Vetter, L. Oliker, "Revolutionary Technologies for Acceleration of Emerging Petascale Applications", Guest Editors, Parallel Computing Journal, 2009,

John Shalf, Harvey Wasserman, Breakthrough Computing in Petascale Applications and Petascale System Examples at NERSC, 2009,

John Shalf, Satoshi Matsuoka, IESP Power Efficiency Research Priorities, 2009,

Samuel Williams, Carter, Oliker, Shalf, Katherine A. Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel Distributed Computing (JPDC), 2009, 69:762-777, doi: 10.1016/j.jpdc.2009.04.002

Brian van Straalen, Shalf, J. Ligocki, Keen, Woo-Sun Yang, "Scalability challenges for massively parallel AMR applications", IPDPS, 2009, 1-12,

G. Hendry, S.A. Kamil, A. Biberman, J. Chan, B.G. Lee, M Mohiyuddin, A. Jain, K. Bergman, L.P. Carloni, J. Kubiatocics, L. Oliker, J. Shalf, "Analysis of Photonic Networks for Chip Multiprocessor Using Scientific Applications", International Symposium on Networks-on-Chip (NOCS), 2009,

Didem Unat, Theodore Hromadka III, Scott B. Baden, "An Adaptive Sub-sampling Method for In-memory Compression of Scientific Data", DCC, IEEE Computer Society, 2009,

Joseph Gebis, Oliker, Shalf, Williams, Katherine A. Yelick, "Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture", International Conference on Architecture of Computing Systems (ARCS), Delft, Netherlands, 2009, 146-158,

Computer2011

David Donofrio, Oliker, Shalf, F. Wehner, Rowen, Krueger, Kamil, Marghoob Mohiyuddin, "Energy-Efficient Computing for Extreme-Scale Science", IEEE Computer, January 2009, 42:62-71, doi: 10.1109/MC.2009.35

 

 

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

S. Kamil, L. Oliker, A. Pinar, J. Shalf, "Communication Requirements and Interconnect Optimization for High-End Scientific Applications\", IEEE Transactions on Parallel and Distributed Systems (TPDS), 2009,

John Shalf and Jason Hick (Arie Shoshani and Doron Rotem), "Storage Technology Fundamentals", Scientific Data Management: Challenges, Technology, and Deployment, Volume . Chapman & Hall/CRC, 2009,

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

S. Amarasinghe, D. Campbell, W. Carlson, A. Chien, W. Dally, E. Elnohazy, M. Hall, R. Harrison, W. Harrod, K. Hill, J. Hiller, S. Karp, C. Koelbel, D. Koester, P. Kogge, J. Levesque, D. Reed, V. Sarkar, R. Schreiber, M. Richards, A. Scarpelli, J. Shalf , A. Snavely, T. Sterling, "ExaScale Software Study: Software Challenges in Extreme Scale Systems", 2009,

John Shalf, Thomas Sterling, "Operating Systems For Exascale Computing", 2009,

Bronis R de Supinski, Sadaf Alam, David H Bailey, Laura, Chris Daley, Anshu Dubey, Todd Gamblin, Dan, Paul D Hovland, Heike Jagode, Karen Karavanic, Marin, John Mellor-Crummey, Shirley Moore, Boyana, Leonid Oliker, Catherine Olschanowsky, Philip C, Martin Schulz, Sameer Shende, Allan Snavely, Wyatt, Mustafa Tikir, Jeff Vetter, Pat Worley, Nicholas Wright, "Modeling the Office of Science ten year facilities plan: The PERI Architecture Tiger Team", Journal of Physics: Conference Series, 2009, 180:012039, doi: 10.1088/1742-6596/180/1/012039

J. Borrill, L. Oliker, J. Shalf, H. Shan, A. Uselton, "HPC global file system performance analysis using a scientific-application derived benchmark", Parallel Computing, 2009, 35:358-373, doi: 10.1016/j.parco.2009.02.002

Khaled Z. Ibrahim, J. Jaeger, Z. Liu, L.N. Pouchet, P. Lesnicki, L. Djoudi, D.Barthou, F. Bodin, C. Eisenbeis, G. Grosdidier, O. Pene, P. Roudeau, "Simulation of the Lattice QCD and Technological Trends in Computation", The 14th International Workshop on Compilers for Parallel Computers (CPC 09), 2009,

John Shalf, Erik Schnetter, Gabrielle Allen, Edward Seidel, Cactus and the Role of Frameworks in Complex Multiphysics HPC Applications, 2009,

John Shalf, Auto-Tuning: The Big Questions (Panel), 2009,

S. Kamil, L. Oliker, A. Pinar, J. Shalf, "Communication Requirements and Interconnect Optimization for High-End Scientific Applications", IEEE Transactions on Parallel and Distributed Systems (TPDS), 2009,

John Shalf, David Donofrio, Green Flash: Extreme Scale Computing on a Petascale Budget, 2009,

91BYM4FUXJL

Kaushik Datta, Kamil, Williams, Oliker, Shalf, Katherine A. Yelick, "Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors", SIAM Review, 2009, 51:129-159, doi: 10.1137/070693199

2008

S. Williams, Auto-tuning Performance on Multicore Computers, Ph.D. Thesis Dissertation Talk, University of California at Berkeley, 2008,

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick, "Stencil Computation Optimization and Auto-Tuning on State-of-the-Art Multicore Architectures", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2008, doi: 10.1109/SC.2008.5222004

Lin-Wang Wang, Byounghak Lee, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, David H. Bailey, "Linearly scaling 3D fragment method for large-scale electronic structure calculations", Proceedings of SC08, November 2008,

Samuel Webb Williams, Andrew Waterman, David A. Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", EECS Tech Report UCB/EECS-2008-134, October 2008,

Paul Hargrove, Jason Duell, Eric Roman, System-level Checkpoint/Restart with BLCR, Los Alamos Computer Science Symposium (LACSS08), October 15, 2008,

S. Williams, et al, "Auto-tuning and the Roofline model", View From the Top: Craig Mundie (Ph.D student poster session), 2008,

Cy Chan, Shoaib Kamil, John Shalf, Generalized Multicore Autotuning for Stencil-based PDE Solvers, Lawrence Berkeley National Laboratory, August 21, 2008,

S. Forrest, S. Hofmeyr, A. Somayaji., "The evolution of system-call monitoring", Annual Computer Security Applications Conference (ACSAC), August 12, 2008,

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

S. Williams, et al., A Vision for Integrating Performance Counters into the Roofline model, UPCRC PMU Workshop (Performance Counters), 2008,

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,

D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

Paul Hargrove, Jason Duell, Eric Roman, Advanced Checkpoint Fault Tolerance Solutions for HPC, Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing, Bangkok and Phuket Thailand, June 9, 2008,

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Program Analysis and Optimization, ParLab Summer Retreat, 2008,

S. Zhou, D. Duffy, T. Clune, M. Suarez, S. Williams, M. Halem, "Impacts of the IBM Cell Processor on Supporting Climate Models", International Supercomputing Conference (ISC), 2008,

K. Datta, S. Williams, V. Volkov, M. Murphy, "Autotuning Structured Grid Kernels", ParLab Summer Retreat, 2008,

S. Williams, et. al, "The Roofline Model: A Pedagogical Tool for Program Analysis and Optimization", Parlab Summer Retreat, 2008,

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", IEEE International Symposium on Parallel and Distributed Processing. BEST PAPER AWARD - Applications Track, 2008, doi: 10.1109/IPDPS.2008.4536295

Paul H. Hargrove, Dan Bonachea, Christian Bell, Experiences Implementing Partitioned Global Address Space (PGAS) Languages on InfiniBand, OpenFabrics Alliance 2008 International Sonoma Workshop, April 2008,

M. Wehner, L. Oliker, J. Shalf, "Performance Characterization of the World's Most Powerful Supercomputers", Internation Journal of High Performance Computing Applications (IJHPCA), April 2008,

Paul Hargrove, Jason Duell and Eric Roman, An Overview of Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters, Presentation to ParLab group at UC Berkeley, March 18, 2008,

Gabrielle Allen (LSU/CCT), Gene Allen (MSC Inc.), Kenneth Alvin (SNL), Matt Drahzal (IBM), David Fisher (DoD-Mod), Robert Graybill (USC/ISI), Bob Lucas (USC/ISI), Tim Mattson (Intel), Hal Morgan (SNL), Erik Schnetter (LSU/CCT), Brian Schott (USC/ISI), Edward Seidel (LSU/CCT), John Shalf (LBNL/NERSC), Shawn Shamsian (MSC Inc.), David Skinner (LBNL/NERSC), Siu Tong (Engeneous) (2008), "Frameworks for Multiphysics Simulation : HPC Application Software Consortium Summit Concept Paper.", 2008,

Shoaib Kamil, Shalf, Erich Strohmaier, "Power efficiency in high performance computing", IPDPS, 2008, 1-8,

Hongzhang Shan, Antypas, John Shalf, "Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark", SC, 2008, 42,

Michael F. Wehner, L. Oliker, John Shalf, "Towards Ultra-High Resolution Models of Climate and Weather", Internation Journal of High Performance Computing Applications (IJHPCA), January 2008, 22:149-165,

Shantenu Jha, Hartmut Kaiser, Andre Merzky, John Shalf, "SAGA - The Simple API for Grid Applications - Motivation, Design, and Implementation", Encyclopedia of Grid Technologies and Applications, Volume 1. Information Science Reference (www.info-sci-ref.com), 2008,

S. Ethier, W. M. Tang, R. Walkup, L. Oliker, "Large-Scale Gyrokenetic particle simulation of Microturbulence in Magnetically Confined Fusion Plasmas", IBM Journal of Research and Development, 2008,

Antypas, K., Shalf, J., and Wasserman, H., "NERSC-6 Workload Analysis and Benchmark Selection Process", 2008, LBNL 1014E,

Samuel Williams, Oliker, W. Vuduc, Shalf, A. Yelick, James Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Parallel Computing, 2008, 35:38, doi: 10.1016/j.parco.2008.12.006

Harvey Wasserman, NERSC Workload Analysis and Benchmark Approach, 2008,

Harvey Wasserman, Breakthrough Science at NERSC, 2008,

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, Lattice Boltzmann simulation optimization on leading multicore platforms, IEEE International Symposium on Parallel & Distributed Processing (IPDPS)., Pages: 1-14 2008,

John Shalf, NERSC User IO Cases, 2008,

Harvey Wasserman, Future Scientific Computing Challenges at NERSC, 2008,

Antypas, K. Shalf, J., and Wasserman, H., Recent Workload Characterization Activities at NERSC, 2008,

John Shalf, Neuroinformatics Congress: Future Hardware Challenges for Scientific Computing, 2008,

M. Wehner, L. Oliker, J. Shalf, Ultra-Efficient Exascale Scientific Computing, 2008,

S. Williams, et al., Autotuning Sparse and Structured Grid Kernels, Parlab Winter Retreat, 2008,

K. Datta, S. Williams, S. Kamil, "Autotuning Structured Grid Kernels", Parlab Winter Retreat, 2008,

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, PERI -- Auto-tuning Memory-intensive Kernels for Multicore, Journal of Physics: Conference Series, Pages: 012038 2008,

2007

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, DOE/DOD Workshop on Emerging High-Performance Architectures and Applications, 2007,

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, Supercomputing (SC), 2007,

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core processors, Center for Scalable Application Development Software (CScADS), 2007,

Vassilis Papaefstathiou, Dionisios Pnevmatikatos, Manolis Marazakis, Giorgos Kalokairinos, Aggelos Ioannou, Michael Papamichael, Stamatis Kavadias, George Michelogiannakis, Manolis Katevenis, "Prototyping Efficient Interprocessor Communication Mechanics", International Conference on Embedded Computer Systems: Architectures, Modelling and Simulations, IEEE Computer Society, 2007,

Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapidsystemprototypingbecomesimportantindesigningand evaluating their architecture. We present an efficient FPGA- based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core SMPs, Parlab Seminar, 2007,

Costin Iancu, Erich Strohmaier, "Optimizing communication overlap for high-speed networks", Principles and Practice of Parallel Programming (PPoPP), 2007,

Approaching Ideal NoC Latency with Pre-Configured Routes, George Michelogiannakis, University of Crete, 2007,

George Michelogiannakis, Dionisios Pnevmatikatos, Manolis Katevenis, "Approaching Ideal NoC Latency with Pre-Configured Routes", First International Symposium on Networks-on-Chip, IEEE Computer Society, 2007,

In multi-core ASICs, processors and other compute engines need to communicate with memory blocks and other cores with latency as close as possible to the ideal of a direct buffered wire. However, current state of the art networks-on- chip (NoCs) suffer, at best, latency of one clock cycle per hop. We investigate the design of a NoC that offers close to the ideal latency in some preferred, run-time configurable paths. Processors and other compute engines may perform network reconfiguration to guarantee low latency over different sets of paths as needed. Flits in non-preferred paths are given lower priority than flits in preferred ones, and suffer a delay of one clock cycle per hop when there is no contention. To achieve our goal, we use the "madpostman" [5] technique: every incoming flit is eagerly (i.e. speculatively) forwarded to the input's preferred output, if any. This is accomplished with the mere delay of a single pre-enabled tri-state driver. We later check if that decision was correct, and if not, we forward the flit to the proper output. Incorrectly forwarded flits are classified as dead and eliminated in later hops. We use a 2D mesh topology tailored for processor-memory communication, and a modified version of XY routing that remains deadlock-free. Our evaluation shows that, for the preferred paths, our approach offers typical latency around 500 ps versus 1500 ps for a full clock cycle or 135 ps for an ideal direct connect, in a 130 nm technology; non-preferred paths suffer a one clock cycle delay per hop, similar to that of other approaches. Performance gains are significant and can be proven greatly useful in other application domains as well.

Paul Hargrove, Eric Roman, Jason Duell, Job Preemption with BLCR, Urgent Computing Workshop, April 25, 2007,

Leonid Oliker, Julian Borrill, Hongzhang Shan, John Shalf, Investigation Of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark., 2007,

John Shalf, Shoaib Kamil, David Skinner, Leonid Oliker, Interconnect Requirements for HPC Applications, 2007,

L. Oliker, R. Biswas, R. Van der Wijngaart, D. Baily, A. Snavely, "Performance Evaluation and Modeling of Ultra-Scale Systems", Parallel Processing for Scientific Computing, edited by Michael A. Heroux, Padma Raghavan, and Horst D. Simon, (SIAM: 2007) doi: 0.1137/1.9780898718133.ch5

John Shalf, Shoaib Kamil, David Bailey, Erich Strohmaier, Power Efficiency and the Top500, 2007,

Hongzhang Shan and John Shalf, "Using IOR to Analyze the I/O performance for HPC Platforms", CUG.org, 2007, LBNL 62647,

John Shalf, Power, Cooling, and Energy Consumption for the Petascale and Beyond., 2007,

Shoaib Kamil, Pinar, Gunter, Lijewski, Oliker, John Shalf, "Reconfigurable hybrid interconnection for static and dynamic scientific applications", Conf. Computing Frontiers, 2007, 183-194, LBNL 60060,

John Shalf, Petascale Computing Application Challenges., 2007,

J. Shalf, L. Oliker, M. Lijewski, S. Kamil, J. Carter, A. Canning, S. Ethier, "Performance Characteristics of Potential Petascale Scientific Applications", Chapman & Hall/CRC Computational Science, (CRC Press: 2007) Pages: 1

Book Chapter

A. Napov and Y Notay, "Algebraic Analysis of V-Cycle Multigrid", Report GANMN 07-01, January 1, 2007,

John Shalf, "The New Landscape of Parallel Computer Architecture", Journal of Physics: Conference Series, Volume . IOP Electronics Journals, 2007,

Shoaib Kamil, John Shalf, Power Efficiency Metrics for the Top500, 2007,

Shoaib Kamil, John Shalf, "Measuring Power Efficiency of NERSC's Newest Flagship Machine", 2007,

Harvey Wasserman, NERSC XT3/XT4 Benchmarking, 2007,

John Shalf, Landscape of Computing Architecture: Introduction to the "Berkeley View, 2007,

J. Borrill, L. Oliker. J. Shalf, H. Shan, "Investigation Of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2007,

Leonid Oliker, Canning, Carter, Iancu, Lijewski, Kamil, Shalf, Shan, Strohmaier, Ethier, Tom Goodale, "Scientific Application Performance on Candidate PetaScale Platforms", IEEE International Symposium on Parallel & Distributed Processing (IPDPS). BEST PAPER AWARD - application track., 2007, 1-12, doi: 10.1109/IPDPS.2007.370259

John Shalf, About Memory Bandwidth and Multicore, 2007,

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

J. Carter, Y. He, J. Shalf, H. Shan, E. Strohmaier, H. Wasserman, "The Performance Effect of Multi-core on Scientific Applications", Proceedings of Cray User Group, 2007, LBNL 62662,

John Shalf, The Landscape of Parallel Computing Architecture., 2007,

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", Extended Version: Lecture Notes in Computer Science, 2007,

L. Oliker, J. Shalf, M. Wehner, Climate Modeling at the Petaflop Scale using Semi-Custom Computing, SIAM Conference on Computational Science and Engineering, 2007,

J. Levesque, J. Larkin, M. Foster, J. Glenski, G. Geissler, S. Whalen, B. Waldecker, J. Carter, D. Skinner, Y. He, H. Wasserman, J. Shalf, H. Shan, E. Strohmaier, "Understanding and Mitigating Multicore Performance Issues on the AMD Opteron Architecture", 2007, LBNL 62500,

John Shalf, Overturning the Conventional Wisdom for the Multicore Era: Everything You Know is Wrong, 2007,

John Shalf, Honzhang Shan, User Perspective on HPC I/O Requirements., 2007,

John Shalf, NERSC Workload Analysis, 2007,

Samuel Williams, Shalf, Oliker, Kamil, Husbands, Katherine A. Yelick, "Scientific Computing Kernels on the Cell Processor", International Journal of Parallel Programming, January 2007, 35:263-298, doi: 10.1007/s10766-007-0034-5

John Shalf, NERSC Power Efficiency Analysis., 2007,

John Shalf, Memory Subsystem Performance and QuadCore Predictions, 2007,

2006

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley", EECS Technical Report, December 2006,

Dan Bonachea, Rajesh Nishtala, Paul Hargrove, Mike Welcome, Kathy Yelick, "Optimized Collectives for PGAS Languages with One-Sided Communication", Poster Session at SuperComputing 2006, November 2006,

Hongzhang Shan, Erich Strohmaier, Ji Qiang, David H. Bailey, Kathy Yelick, "Performance modeling and optimization of a high energy colliding beam simulation code", Proceedings of SC2006, November 2006,

Dan Bonachea, Rajesh Nishtala, Paul Hargrove, Katherine Yelick, Efficient Point-to-Point Synchronization in UPC, 2nd Conf. on Partitioned Global Address Space Programming Models (PGAS06), October 4, 2006,

S. Williams, et al., Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor, Global Signal Processing Expo (GSPx), 2006,

S. Williams, et al., 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D), UTK Summit on Software and Algorithms for the Cell Processor, 2006,

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

S. Williams, et al, The Potential of the Cell Processor for Scientific Computing, presented at Transmeta, 2006,

S. Williams, et al., The Potential of the Cell Processor for Scientific Computing, LBL Scientific Computing Seminar, 2006,

L. Oliker, J. Carter, M. Wehner, A. Canning, S. Ethier, A. Mirin, G. Bala, D. Parks, P. Worley, S. Kitawaki, Y. Tsuda, "Scientific Application Performance on Leading Scalar and Vector Supercomputing Platforms", International Journal of High Performance Computing Applications (IJHPCA), 2006,

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", High Performance Computing for Computational Science., 2006,

Highest Ranked Conference Paper

Samuel Williams, Shalf, Oliker, Kamil, Husbands, Katherine A. Yelick, The potential of the cell processor for scientific computing, Conf. Computing Frontiers, Pages: 9-20 2006,

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", VECPAR, 2006,

Tom Goodale, Shantenu Jha, Hartmut Kaiser, Thilo Kielmann, Pascal Kleijer, Gregor von Laszewski, Craig Lee, Andre Merzky, Hrabri Rajic, Hrabri, John Shalf, "SAGA: A Simple API for Grid Applications -- High-Level Application Programming on the Grid", Computational Methods in Science and Technology, Volume 12(1). Poznan, 2006, LBNL 59066,

John Shalf, David Bailey, Top500 Power Efficiency, 2006,

Hongzhang Shan, John Shalf, "Analysis of Parallel IO on Modern HPC Platforms", 2006,

  • Download File: IOR.doc (doc: 399 KB)

Analysis of the parallel IO requirements from a number of HPC applications, combined with microbenchmarks to aid in understanding their performance.

Jonathan Carter, Oliker, John Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", VECPAR, Springer Berlin/Heidelberg, 2006, 4395:490-503,

Shoaib Kamil, Datta, Williams, Oliker, Shalf, Katherine A. Yelick, "Implicit and explicit optimizations for stencil computations", Memory System Performance and Correctness, 2006, 51-60, doi: 10.1145/1178597.1178605

J. Carter, L. Oliker, "Performance Evaluation of Lattice-Boltzmann Magnetohyrodynamics Simulations on Modern Parallel Vector Systems", Proceedings of the 2nd Teraflop Workshop. Lecture Notes in Computer Science (LNCS), Stuttgard, Germany, January 1, 2006,

L. Oliker, J. Carter, Leading Computational Methods on the Earth Simulator, SIAM Conference on Parallel Processing for Scientific Computing, 2006,

2005

S. Williams, J. Shalf, L. Oliker, P. Husbands, K. Yelick, "Dense and Sparse Matrix Operations on the Cell Processor", LBNL Technical Report, 2005,

IPIF to PCI bridge specification, George Michelogiannakis, University of Crete, 2005,

S. Hofmeyr, New approaches to security: lessons from nature, Secure Convergence Journal, June 1, 2005,

Leonid Oliker, Jonathan Carter, Michael Wehner, Andrew Canning, Stephane Ethier, Art Mirin, David Parks, Patrick Worley, Shigemune Kitawaki, Yoshinori Tsuda, "Leading Computational Methods on Scalar and Vector HEC Platforms", SC 05, Washington, DC, USA, IEEE Computer Society, 2005, 62, doi: 10.1109/SC.2005.41

J. Carter, M. Soe, L. Oliker, Y. Tsuda, G. Vahala, L. Vahala, A. Macnab, "Magnetohydrodynamic Turbulence Simulations on the Earth Simulator Using the Lattice Boltzmann Method", International Conference for High Performance Computing, Networking, Storage and Analysis (SC) - Gordon Bell Finalist, Washington, DC, USA, IEEE Computer Society, 2005,

S. Kamil, J. Shalf, L. Oliker, D. Skinner,, "Understanding Ultra-Scale Application Communication Requirements", IEEE International Symposium on Workload Characterization (IISWC), 2005,

Shoaib Kamil, Husbands, Oliker, Shalf, Katherine A. Yelick, "Impact of modern memory subsystems on cache optimizations for stencil computations", Memory System Performance, 2005, 36-43,

J. Borrill, J. Carter, L. Oliker, D. Skinner, R. Biswas, "Integrated performance monitoring of a cosmology application on leading HEC platforms", Proceedings of the International Conference on Parallel Processing, 2005, 2005:119-128, doi: 10.1109/ICPP.2005.47

Leonid Oliker, Canning, Carter, Shalf, Skinner, Ethier, Biswas, Jahed Djomehri, Rob F. Van der Wijngaart, "Performance evaluation of the SX-6 vector architecture for scientific computations", Concurrency - Practice and Experience, January 2005, 17:69-93, doi: 10.1002/cpe.884

H. Simon, W. Kramer, W. Saphir, J. Shalf, D. Bailey, L. Oliker, et al, "Science Driven System Architecture: A New Process for Leadsership Class Computing", Journal of the Earth Simulator, 2005,

L. Oliker, A. Canning, J. Carter, J. Shalf, H. Simon, S. Ethier, D. Parks, S. Kitawaki, Y. Tsuda, T. Sato, "Performance of Ultra-Scale Applications on Leading Vector and Scalar HPC Platforms", Journal of the Earth Simulator, January 2005, 3,

John Shalf, Kamil, Oliker, David Skinner, "Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2005, 17,

Horst Simon, William Kramer, William Saphir, John Shalf, David Bailey, Leonid Oliker, Michael Banda, C. William McCurdy, John Hules, Andrew Canning, Marc Day, Philip Colella, David Serafini, Michael Wehner, Peter Nugent, "Science-Driven System Architecture: A New Process for Leadership Class Computing", Journal of the Earth Simulator, Volume 2., 2005, LBNL 56545,

L. Oliker, R. Biswas, J. Borrill, A. Canning, J. Carter, M.J. Djomehri, H. Shan, D. Skinner, "A performance evaluation of the cray X1 for scientific applications", Lecture Notes in Computer Science, Springer Berlin/Heidelberg, 2005, 3402:51-65,

2004

E. Strohmaier, Hongzhang Shan, "Architecture Independent Performance Characterization and Benchmarking for Scientific Applications", International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Volendam, The Netherlands, October 2004,

Hongzhang Shan, E. Strohmaier, "Performance Characterization of Cray X1 and Their Implications for Application Performance Tuning", International Conference of Supercomputing, Malo, France, June 2004,

J. Gebis, S. Williams, D. Patterson, C. Kozyrakis, "VIRAM1: A Media-Oriented Vector Processor with Embedded DRAM", 41st Design Automation Student Design Contest (DAC), 2004,

H. Shan, E. Strohmaier, L. Oliker, "Optimizing Performance of Superscalar Codes for a Single Cray X1 MSP", Proceedings of the 46th Cray User Group Conference:CUG, 2004,

P. A. Agarwal,R. A. Alexander , E. Apra, S. Balay, A. S. Bland, J. Colgan, E. F.D’Azevedo , J. J. Dongarra , T. H. Dunigan, Jr. , M. R. Fahey, R. A. Fahey, A. Geist, M. Gordon, R. J. Harrison , D. Kaushik, M. Krishnakumar , P. Luszczek , A. Mezzacappa, J. A. Nichols , J. Nieplocha, L. Oliker, T. Packwood , M.S. Pindzola, T. C. Schulthess, J. S. Vetter, J. B. White, III , T. L. Windus , P. H. Worley, T. Zacharia, "Cray X1 Evaluation Status Report", Proceedings of the 46th Cray User Group Conference:CUG, 2004,

L. Oliker, J. Carter, Evaluation of Vector Architectures for Scientific Codes, SIAM Conference on Parallel Processing for Scientific Computing, 2004,

L. Oliker, M. Wehner, D. Parks, W.S. Wang, High Resolution Atmospheric General Circulation Model Simulations on Vector and Cache-based Architectures, SIAM Conference on Parallel Processing for Scientific Computing, 2004,

J. Duell, P. Hargrove, E. Roman, An Overview of Berkeley Lab's Linux Checkpoint/Restart, Presentation at LLNL, January 2004,

J. Carter, J. Borrill, L. Oliker, "Performance characteristics of a cosmology package on leading HPC architectures", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Berlin/Heidelberg, 2004, 3296:176-188,

H. Shan, L. Oliker, R. Biswas, W. Smith, "Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration", International Conference on Advanced Computing and Communication: ADCOM, 2004,

Leonid Oliker, Canning, Carter, Shalf, St\ ephane Ethier, "Scientific Computations on Modern Parallel Vector Systems", International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Nominated Best Paper Award, Washington, DC, USA, IEEE Computer Society, 2004, 10, doi: 10.1109/SC.2004.54

L. Oliker, J. Borril, A. Canning, J. Carter, H. Shan, D. Skinner, R. Biswas, J. Djomheri, "A Performance Evaluation of the Cray X1 for Scientific Applications", VECPAR'04: 6th International Meeting on High Performance Computing for Computational Science, 2004,

G. Griem, L. Oliker, J. Shalf, K. Yelick, "Identifying Performance Bottlenecks on Modern Microarchitectures Using an Adaptable Probe", IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2004,

2003

R. Biswas, L. Oliker, H. Shan, "Parallel Computing Strategies for Irregular Algorithms", Annual Review of Scalable Computing, April 2003,

Hongzhang Shan, Jaswinder P. Singh, Leonid Oliker, Rupak Biswas, "Message Passing and Shared Address Space Parallelism on an SMP Cluster", Parallel Computing Journal, Volume 29, Issue 2, February 2003,

L. Oliker, G. Griem, "Transitive Closure on the Imagine Stream Processor", Fifth Workshop on Media and Stream Processors (MSP5), 2003,

Leonid Oliker, Andrew Canning, Jonathan Carter, John Shalf, David Skinner, Stephane Ethier, Rupak Biswas, Jahed Djomehri, Rob F. Van der Wijngaart, "Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations", International Conference for High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, 2003, 38, doi: 10.1145/1048935.1050213

H. Shan, L. Oliker, R.Biswas, "Job Superscheduler Architecture and Performance in Computational Grid Environments", International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2003,

S. Chatterji, J. Duell, M. Narayanan, L. Oliker, "Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine", Workshop on Parallel and Distributed Image Processing, Video Processing, and Multimedia (PDIVM), 2003,

2002

J. Duell, P. Hargrove, E. Roman, "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart", LBNL Technical Report, December 2002, LBNL 54941,

Performance Characterization and Benchmarking for High Performance Systems and Applications, TOP500 BOF; SC2002, November 20, 2002,

E. Strohmaier, Performance Characterization and Benchmarking for High Performance Systems and Applications, University of Tennessee, CS Seminar, November 8, 2002,

Erich Strohmaier, Performance Characterization and Benchmarking for High Performance Systems and Applications, CCS Seminar, October 9, 2002,

Erich Strohmaier, Benchmarking for High Performance Systems and Applications, DARPA HPCS Performance Workshop, September 19, 2002,

J. Duell, P. Hargrove, E. Roman, "Requirements for Linux Checkpoint/Restart", LBNL Technical Report, May 2002, LBNL 49659,

H. Shan, J. Singh, "A Comparison of Three Programming Models for Adaptive Applications on the Origin2000", Extended Version: Journal of Parallel and Distributed Computing, 2002,

B. Gaeke, P. Husbands, X. Li, L. Oliker, K. Yelick, and R. Biswas, "Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines", International Parallel & Distributed Processing Symposium (IPDPS), 2002,

L. Oliker. X. Li, P. Husbands, R. Biswas, "Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations", SIAM Review Journal, 2002,

2001

S. Hofmeyr, "An interpretive introduction to the immune system", Design Principles for the Immune System and Other Distributed Autonomous Systems, ( June 14, 2001)

S. Forrest, S. Hofmeyr, "Immunology as information processing", Design Principles for the Immune System and other Distributed Autonomous Systems, ( June 14, 2001)

C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick, "Hardware/Compiler Co-development for an Embedded Media Processor", Proceedings of the IEEE, 2001, doi: 10.1109/5.964446

S. Forrest, S. Hofmeyr, "Engineering an immune system", Graft, June 1, 2001,

L. Oliker, X. Li, P. Husbands, R. Biswas, "Ordering Schemes for Sparse Matrices using Modern Programming Paradigms", The IASTED International Conference on Applied Informatics (AI), 2001,

H. Shan, J. Singh, L. Oliker, R. Biswas, "Message Passing vs. Shared Address Space on a Cluster of SMPs", International Parallel & Distributed Processing Symposium (IPDPS), 2001,

H. Shan, J. Singh, L. Oliker, R. Biswas, Design Strategies for Irregularly Adapting Parallel Applications, SIAM Conference on Parallel Processing, 2001,

L. Oliker, R. Biswas, P. Husbands, X. Li, Ordering Sparse Matrices for Cache-Based Systems, SIAM Conference on Parallel Processing, 2001,

2000

S. Hofmeyr, S. Forrest, "Architecture for an artificial immune system", Evolutionary Computation, December 1, 2000,

C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, K. Yelick, Vector IRAM: A media-oriented vector processor with embedded DRAM, Hot Chips 12, 2000,

H. Shan, J. Singh, L. Oliker, R. Biswas, "A Comparison of Three Programming Models for Adaptive Applications on the Origin2000", International Conference for High Performance Computing, Networking, Storage and Analysis (SC) - BEST STUDENT PAPER AWARD, 2000,

L. Oliker, A. Wong, W. Kramer, T. Kaltz, D. Bailey, "ESP: A System Utilization Benchmark", International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2000,

L. Oliker, R. Biswas, Multithreading for Dynamic Irregular Applications, First SIAM Conference on Computational Science and Engineering, 2000,

L. Oliker, X. Li. G. Heber, R. Biswas, "Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms", 13th Interational Conference on Parallel and Distributed Computing Systems, 2000,

L. Oliker, A. Wong, W. Kramer, T. Kaltz, D. Bailey, "System Utilization Benchmark on the Cray T3E and IBM SP", Fifth Workshop on Job Scheduling, 2000,

L. Oliker, X. Li, G. Heber, R. Biswas, "Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems", Seventh International Workshop on solving Irregularly Structured Problems in Parallel, 2000,

L. Oliker, R. Biswas, "Multithreaded Implementation of a Dynamic Irregular Application", 5th NASA Computational Aerosciences Workshop, 2000,

L. Oliker, R. Biswas, S. Das, D. Harvey, "Parallel Dynamic Load Balancing Strategies for Adaptive Irregular Applications", DRAMA special issue of Applied Mathematical Modeling Journal, 2000,

L. Oliker, R. Biswas, "Parallelization of a Dynamic Unstructured Algorithm using Three Leading Programming Paradigms", IEEE Transactions on Parallel and Distributed System (TPDS), 2000,

L. Oliker, R. Biswas and H. Gabow, "Parallel Tetrahedral Mesh Adaptation with Dynamic Load Balancing", Parallel Computing Journal, Special Issue on Graph Partitioning, pp 1583-1608, 2000,

1999

S. Hofmeyr, S .Forrest, "Immunity by design: an artificial immune system", GECCO, June 13, 1999,

L. Oliker, R. Biswas, "Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms", International Conference for High Performance Computing, Networking, Storage and Analysis (SC) - BEST PAPER AWARD, 1999,

R. Biswas, S.K. Das, and D.J. Harvey, L. Oliker, "Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications", 13th International Parallel Processing Symposium, 1999,

R. Biswas, L. Oliker, "Experiments with Repartitioning and Load Balancing Adaptive Meshes", Grid Generation and Adaptive Algorithms, IMA Volumes in Mathematics and its Applications, Vol. 113, Springer-Verlag, pp.89-112, 1999,

1998

A. Somayaji, S. Hofmeyr, S. Forrest, "Principles of a computer immune system", New Security Paradigms Workshop (NSPW), September 22, 1998,

S. Hofmeyr, S. Forrest, P. D'haeseleer, "An immunological approach to distributed network intrusion detection", Recent Advances in Intrusion Detection (RAID), September 14, 1998,

S. Hofmeyr, A. Somayaji, S. Forrest, "Intrusion detection using sequences of system calls", Journal of Computer Security, June 1, 1998,

K. Schloegel, G. Karypis, V. Kumar, R. Biswas, L. Oliker, "A Performance Study of Diffusive vs. Remapped Load-Balancing Schemes", 11th International Conference on Parallel and Distributed Computer Systems, pp. 59-66, 1998,

PLUM: Parallel Load Balancing for Adaptive Unstructured Meshes, L. Oliker, 1998,

L. Oliker, R. Biswas, H.N. Gabow, "Performance Analysis and Portability of the PLUM Load Balancing System", Euro-Par'98 Parallel Processing, Lecture Notes in Computer Science, Vol. 1470, Springer-Verlag, pp. 307-317, 1998,

L. Oliker, R. Biswas, "PLUM: Parallel Load Balancing for Adaptive Unstructured Meshes (JPDC version)", Journal of Parallel and Distributed Computing (JPDC), 1998,

1997

A. P. Kosoresow, S. Hofmeyr, S. Forrest, "Intrusion detection via system call traces", IEEE Software, September 1, 1997,

L. Oliker, R. Biswas, "Dynamic Domain Decomposition for Large-Scale Adaptive Calculations", 10th International Conference on Domain Decomposition Methods, 1997,

L. Oliker, R. Biswas, "Load Balancing Unstructured Adaptive Grid Computations", 4th U.S. National Congress on Computaional Mechanics, 1997,

R. Biswas, L. Oliker, "Load Balancing Sequences of Unstructured Adaptive Grids", 4th International Conference on High Performance Computing (HiPC), 1997,

R. Strawn, L. Oliker, R. Biswas, "New Computational Methods for the Prediction and Analysis of Helicopter Noise", Journal of Aircraft, 34, pp. 665-672, 1997,

L.Oliker, R. Biswas, "Efficient Load Balancing and Data Remapping for Adaptive Grid Calculations", 9th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 1997,

R. Biswas, L. Oliker, Load Balancing Unstructured Adaptive Grids for CFD Problems, 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997,

S. Forrest, S. Hofmeyr, A. Somayaji, "Computer immunology", Communications of the ACM, January 1, 1997,

1996

S. Chatterjee, J. Gilbert, L. Oliker, R. Schreiber, and T. Sheffler, "Algorithms for Automatic Alignment of Arrays", Journal of Parallel and Distributed Computing (JPDC), July 1996,

R. Biswas, L. Oliker, A. Sohn, "Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems", International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 1996,

L. Oliker, R. Biswas, S. Strawn, "Parallel Implementation of an Adaptive Scheme for 3D Unstructured Grids on the SP2", Parallel Algorithms for Irregularly Structured Problems, Lecture notes in Computer Science, Vol. 1117, Springer-Verlag, pp. 35-47, 1996,

L. Oliker, R. Biswas, S. Strawn, "Parallel Mesh Adaption with Global Load Balancing on the SP2", NASA Computational Aerosciences Workshop, 1996,

A.M. Wissink, A.S. Lyrintzis, R.C. Strawn, L. Oliker, R. Biswas, "Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers", 34th AIAA Aerospace Sciences Meeting, Paper 96-0153, 1996,

1994

S. Forrest, S. Hofmeyr, A. Somayaji, T. A. Longstaff, "A sense of self for UNIX processes", IEEE Symposium on Computer Security and Privacy, May 6, 1994,