Recent Publications

M. Adams, P. Wang, J. Merson, K. Huck, M. Knepley, "A performance portable, fully implicit Landau collision operator with batched linear solvers", SIAM Journal on Scientific Computing, January 1, 2025,

Download File: 3f390d41-6a05-4c32-8f76-3059b1c8c71a.pdf (pdf: 3.3 MB)

Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for example, kinetic discretizations of magnetized plasmas where collisions are advanced in velocity space at each spatial point independently.
This paper builds on previous work on a high-performance, fully nonlinear, Landau collision operator by batching the linear solver, as well as batching the spatial point problems and adding new support for multiple grids for multiscale, multi-species problems. An anisotropic relaxation verification test that agrees well with previous published results and analytical models is presented. The performance results from NVIDIA A100 and AMD MI250X nodes are presented with hardware utilization analysis for each architecture. The entire implicit Landau operator time advance is implemented in Kokkos for performance portability, running entirely on the device and is available in the PETSc numerical library.

Daniel Finn, Matthew Knepley, Joseph Pusztay and Mark Adams, "A Numerical Study of Landau Damping with PETSc-PIC", CAMCoS, March 1, 2023, doi: 10.2140/camcos.2023.18.135

Download File: Finn2023-LD.pdf (pdf: 2.7 MB)

M.F. Adams, D.P. Brennan, M.G. Knepley, P. Wang, "Landau collision operator in the CUDA programming model applied to thermal quench plasmas", 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), July 15, 2022, doi: 10.1109/IPDPS53621.2022.00020

Download File: d9e4ee12-a919-480f-bafe-db3c81602b4d.pdf (pdf: 1.6 MB)

Mark Adams, Satish Balay, Oana Marin, Lois Curfman McInnes, Richard Tran Mills, Todd Munson, Hong Zhang, Junchao Zhang, Jed Brown, Victor Eijkhout, Jacob Faibussowitsch, Matthew Knepley, Fande Kong, Scott Kruger, Patrick Sanan, Barry F. Smith, Hong Zhang, "The PETSc Community as Infrastructure", May 1, 2022, 24, doi: 10.1109/MCSE.2022.3169974

Download File: PetscInfrusturcure.pdf (pdf: 1.3 MB)

The communities that develop and support open-source scientific software packages are crucial to the utility and success of such packages. Moreover, they form an important part of the human infrastructure that enables scientific progress. This article discusses aspects of the Portable Extensible Toolkit for Scientific Computation community, its organization, and technical approaches that enable community members to help each other efficiently and effectively.

J. V. Pusztay, M. G. Knepley, and M. F. Adams, "Conservative Projection Between FEM and Particle Bases", SIAM Journal on Scientific Computing, January 1, 2022, doi: https://doi.org/10.1137/21M145407

Download File: ffce2dc7-07bf-41ec-b97c-7971797b7cc5.pdf (pdf: 782 KB)

R. Mills, M.F. Adams, S. Balay, J. Brown, A. Dener, M. Knepley, S. Kruger, H. Morgan, T. Munson, K. Rupp, B. Smith, S. Zampini, H. Zhang, J. Zhang, Junchao, "Toward performance-portable PETSc for GPU-based exascale systems", Parallel Computing, December 1, 2021, 108, doi: 10.1016/j.parco.2021.102831

The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization. The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming model, such as Kokkos, RAJA, SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using GPUs from PETSc-based codes is provided, and case studies emphasize the flexibility and high performance achieved on current GPU-based systems.

N.B. Bonnheim, M.F. Adams, T. Wu, T.M. Keaveny, "The Role of Vertebral Porosity and Implant Loading Mode on Bone-Tissue Stress in the Human Vertebral Body Following Lumbar Total Disc Arthroplasty", Spine, October 1, 2021, 1022-E1030, doi: 10.1097/BRS.0000000000004023

Hector G. Martin, Tijana Radivojevic, Jeremy Zucker, Kristofer Bouchard, Jess Sustarich, Sean Peisert, Dan Arnold, Nathan Hillson, Gyorgy Babnigg, Jose M. Marti, Christopher J. Mungall, Gregg T. Beckham, Lucas Waldburger, James Carothers, ShivShankar Sundaram, Deb Agarwal, Blake A. Simmons, Tyler Backman, Deepanwita Banerjee, Deepti Tanjore, Lavanya Ramakrishnan, Anup Singh, "Perspectives for Self-Driving Labs in Synthetic Biology", Current Opinion in Biotechnology, February 2023, doi: 10.1016/j.copbio.2022.102881

MB Simmonds, WJ Riley, DA Agarwal, X Chen, S Cholia, R Crystal-Ornelas, ET Coon, D Dwivedi, VC Hendrix, M Huang, A Jan, Z Kakalia, J Kumar, CD Koven, L Li, M Melara, L Ramakrishnan, DM Ricciuto, AP Walker, W Zhi, Q Zhu, C Varadharajan, Guidelines for Publicly Archiving Terrestrial Model Data to Enhance Usability, Intercomparison, and Synthesis, Data Science Journal, 2022, doi: 10.5334/dsj-2022-003

C Varadharajan, VC Hendrix, DS Christianson, M Burrus, C Wong, SS Hubbard, DA Agarwal, BASIN-3D: A brokering framework to integrate diverse environmental data, Computers and Geosciences, 2022, doi: 10.1016/j.cageo.2021.105024

B Faybishenko, R Versteeg, G Pastorello, D Dwivedi, C Varadharajan, D Agarwal, Challenging problems of quality assurance and quality control (QA/QC) of meteorological time series data, Stochastic Environmental Research and Risk Assessment, Pages: 1049--1062 2022, doi: 10.1007/s00477-021-02106-w

F Molz, B Faybishenko, D Agarwal, A broad exploration of nonlinear dynamics in microbial systems motivated by chemostat experiments producing deterministic chaos., 2022,

C Varadharajan, Z Kakalia, E Alper, EL Brodie, M Burrus, RWH Carroll, D Christianson, W Dong, V Hendrix, M Henderson, S Hubbard, D Johnson, R Versteeg, KH Williams, DA Agarwal, The Colorado East River Community Observatory Data Collection, Hydrological Processes 35(6), 2021, doi: 10.22541/au.161962485.54378235/v1

D. A. Agarwal, J. Damerow, C. Varadharajan, D. S. Christianson, G. Z. Pastorello, Y.-W. Cheah, L. Ramakrishnan, "Balancing the needs of consumers and producers for scientific data collections", Ecological Informatics, 2021, 62:101251, doi: 10.1016/j.ecoinf.2021.101251

JE Damerow, C Varadharajan, K Boye, EL Brodie, M Burrus, KD Chadwick, R Crystal-Ornelas, H Elbashandy, RJ Eloy Alves, KS Ely, AE Goldman, T Haberman, V Hendrix, Z Kakalia, KM Kemner, AB Kersting, N Merino, F O Brien, Z Perzan, E Robles, P Sorensen, JC Stegen, RL Walls, P Weisenhorn, M Zavarin, D Agarwal, Sample identifiers and metadata to support data management and reuse in multidisciplinary ecosystem sciences, Data Science Journal, 2021, doi: 10.5334/dsj-2021-011

J Müller, B Faybishenko, D Agarwal, S Bailey, C Jiang, Y Ryu, C Tull, L Ramakrishnan, Assessing data change in scientific datasets, Concurrency and Computation: Practice and Experience, 2021, doi: 10.1002/cpe.6245

SL Brantley, T Wen, DA Agarwal, JG Catalano, PA Schroeder, K Lehnert, C Varadharajan, J Pett-Ridge, M Engle, AM Castronova, RP Hooper, X Ma, L Jin, K McHenry, E Aronson, AR Shaughnessy, LA Derry, J Richardson, J Bales, EM Pierce, The future low-temperature geochemical data-scape as envisioned by the U.S. geochemical community, Computers and Geosciences, 2021, doi: 10.1016/j.cageo.2021.104933

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power, "SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems", Proceedings of the 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED), September 2022,

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power,, "Enabling Design Space Exploration for RISC-V Secure Compute Environments", Proceedings of the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV), (co-located with ISCA 2021), June 17, 2021,

Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert, "Performance Analysis of Scientific Computing Workloads on General Purpose TEEs", Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IEEE, May 2021, doi: 10.1109/IPDPS49936.2021.00115

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power, "SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems", Proceedings of the 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED), September 2022,

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power,, "Enabling Design Space Exploration for RISC-V Secure Compute Environments", Proceedings of the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV), (co-located with ISCA 2021), June 17, 2021,

Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert, "Performance Analysis of Scientific Computing Workloads on General Purpose TEEs", Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IEEE, May 2021, doi: 10.1109/IPDPS49936.2021.00115

Jean Sexton, Zarija Lukic, Ann Almgren, Chris Daley, Brian Friesen, Andrew Myers, and Weiqun Zhang, "Nyx: A Massively Parallel AMR Code for Computational Cosmology", The Journal Of Open Source Software, July 10, 2021,

Weiqun Zhang, Andrew Myers, Kevin Gott, Ann Almgren and John Bell, "AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications", The International Journal of High Performance Computing Applications, June 12, 2021,

Jordan Musser, Ann S Almgren, William D Fullmer, Oscar Antepara, John B Bell, Johannes Blaschke, Kevin Gott, Andrew Myers, Roberto Porcu, Deepak Rangarajan, Michele Rosso, Weiqun Zhang, and Madhava Syamlal, "MFIX:Exa: A Path Towards Exascale CFD-DEM Simulations", The International Journal of High Performance Computing Applications, April 16, 2021,

Chris A. Laliwala, Oluwamayowa O. Amusat, Ana I. Torres, "An Optimization-Based Law of Mass Action Precipitation/Dissolution Model", Systems and Control Transactions, Ghent, Belgium, PSE Press: Hamilton, July 1, 2025, 4:2140-2145, doi: https://doi.org/10.69997/sct.132742

S. Burroughs, B. Lincoln, A. Adeel, I. Severinsen, A. Lee, O. Amusat, D. Gunter, B. Nicholson, M. Apperley, B. Young, J. Siirola, T. G. Walmsley, "New Directions and Software Tools Within the Process Systems Engineering Ecosystem", Systems and Control Transactions, Ghent, Belgium, PSE Press: Hamilton, July 1, 2025, 4:430-436, doi: https://doi.org/10.69997/sct.156838

Jordan A. Welsman, Gunther H. Weber, Oluwamayowa O. Amusat, Anna Giannakou, Lavanya Ramakrishnan, "Enhancing Electron Microscopy Image Classification Using Data Augmentation", SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, November 17, 2024, 64-71, doi: 10.1109/SCW63240.2024.00016

Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan, "Automated annotation of scientific texts for ML-based keyphrase extraction and validation", Database, September 27, 2024, 2024:baae093, doi: https://doi.org/10.1093/database/baae093

Alexander V. Dudchenko, Oluwamayowa O. Amusat, "Neural Networks for Prediction of Complex Chemistry in Water Treatment Process Optimization", Proceedings of the 10th International Conference on Foundations of Computer-Aided Process Design (FOCAPD 2024), Denver, PSE Press, July 19, 2024, 3:267-274, doi: 10.69997/sct.107047

Oluwamayowa O. Amusat, Alexander V. Dudchenko, Adam A. Atia, Timothy Bartholomew, "Cost-optimal Selection of pH Control for Mineral Scaling Prevention in High Recovery Reverse Osmosis Desalination", Proceedings of the 10th International Conference on Foundations of Computer-Aided Process Design (FOCAPD 2024), Denver, PSE Press, July 9, 2024, 3:253-260, doi: 10.69997/sct.143335

Oluwamayowa Amusat, Adam Atia, Timothy Bartholomew, Alexander Dudchenko, Cost-Optimization of Process-Scale Desalination Systems Incorporating Surrogate-based Water Chemistry Models, INFORMS Optimization Society Conference, March 22, 2024,

Oluwamayowa O Amusat, Adam A Atia, Alexander V Dudchenko, Timothy V Bartholomew, "Modeling Framework for Cost Optimization of Process-Scale Desalination Systems with Mineral Scaling and Precipitation", ACS ES&T Engineering, March 8, 2024, doi: 10.1021/acsestengg.3c00537

Mohammed A. Alhussaini, Zachary M. Binger, Bianca M. Souza-Chaves, Oluwamayowa O. Amusat, Jangho Park, Timothy V. Bartholomew, Dan Gunter, Andrea Achilli, "Analysis of backwash settings to maximize net water production in an engineering-scale ultrafiltration system for water reuse", Journal of Water Process Engineering, 2023, 53, doi: 10.1016/j.jwpe.2023.103761

Oluwamayowa O. Amusat, Tim Barthlomew, Adam A. Atia, Cost optimization of desalination systems using WaterTAP incorporating detailed water chemistry models, 2022 INFORMS Annual Meeting, 2022,

Dan Gunter, Oluwamayowa Amusat, Tim Bartholomew, Markus Drouven, "Santa Barbara Desalination Digital Twin Technical Report", LBNL Technical Report, 2021, LBNL LBNL-2001437,

Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan, "Experiences with Reproducibility: Case Studies from Scientific Workflows", (P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, ACM, June 21, 2021, doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

Nan Ding, Oscar Antepara, Zhengji Zhao, Brian Austin, Leonid Oliker, Nicholas J. Wright, Samuel Williams, "Maximizing Power-Constrained Supercomputing Throughput", ISC'25, June 11, 2025,

Download File: ISC25_GPU_Power_Cap.pdf (pdf: 5.2 MB)

Sterling Smith, Zichuan Anthony Xing, Torrin Bechtel, Severin Denk, Earl DeShazer, Orso Meneghini, Tom Neiser, Laurie Stephey, Oscar Antepara, Christopher Mitchell Clark, Eli Dart, Pengfei Ding, Sean Flanagan, Raffi Nazikian, David Schissel, Christine Simpson, Nicholas Tyler, Thomas D. Uram, Samuel Williams, "Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D National Fusion Facility Using Leadership Class Computing Resources", Extreme-Scale Experiment-in-the-Loop Computing (XLOOP), November 2024,

Oscar Antepara, Samuel Williams, Max Carlson, Jerry Watkins, "Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers", Performance, Portability & Productivity in HPC (P3HPC), November 2024,

Download File: P3HPC24_IceSheet_final-v2.pdf (pdf: 1.4 MB)

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Download File: P3HPC24_bricks_mg_final.pdf (pdf: 358 KB)

Mahesh Lakshminarasimhan, Oscar Antepara, Tuowen Zhao, Benjamin Sepanski, Protonu Basu, Hans Johansen, Mary Hall, Samuel Williams, "Bricks: A high-performance portability layer for computations on block-structured grids", The International Journal of High Performance Computing Applications (IJHPCA), August 19, 2024, doi: 10.1177/10943420241268288

Mahesh Lakshminarasimhan, Mary Hall, Samuel Williams, Oscar Antepara, "BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs", Proceedings of the 53rd International Conference on Parallel Processing (ICPP), August 12, 2024,

Download File: ICPP24_BrickDL_final-v2.pdf (pdf: 1.7 MB)

Oscar Antepara, Samuel Williams, Scott Kruger, Torrin Bechtel, Joseph McClenaghan, Lang Lao, "Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code", Workshop on Accelerator Programming and Directives (WACCPD), November 2023,

Download File: WACCPD23_EFIT_final.pdf (pdf: 697 KB)

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Download File: P3HPC23_bricks_final-v4.pdf (pdf: 684 KB)

Nan Ding, Oscar Antepara, Zhengji Zhao, Brian Austin, Leonid Oliker, Nicholas J. Wright, Samuel Williams, "Maximizing Power-Constrained Supercomputing Throughput", ISC'25, June 11, 2025,

Download File: ISC25_GPU_Power_Cap.pdf (pdf: 5.2 MB)

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_DCGM_final.pdf (pdf: 319 KB)

Shashank Subramanian, Ermal Rrapaj, Peter Harrington, Smeet Chheda, Steven Farrell, Brian Austin, Samuel Williams, Nicholas Wright, Wahid Bhimji, "Comprehensive Performance Modeling and System Design Insights for Foundation Models", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_ModelingTransformerTraining_final.pdf (pdf: 736 KB)

M. Avaylon, R. Sadre, Z. Bai, T. Perciano, "Adaptable Deep Learning and Probabilistic Graphical Model System for Semantic Segmentation", Advances in Artificial Intelligence and Machine Learnin, March 31, 2022, 2:288--302, doi: 10.54364/AAIML.2022.1119

Venkitesh Ayyar, Robert Knop, Autumn Awbrey, Alexis Andersen, Peter Nugent, "Identifying Transient Candidates in the Dark Energy Survey Using Convolutional Neural Networks", Publications of the Astronomical Society of the Pacific, September 2022, 134:094501,

The ability to discover new transient candidates via image differencing without direct human intervention is an important task in observational astronomy. For these kind of image classification problems, machine learning techniques such as Convolutional Neural Networks (CNNs) have shown remarkable success. In this work, we present the results of an automated transient candidate identification on images with CNNs for an extant data set from the Dark Energy Survey Supernova program, whose main focus was on using Type Ia supernovae for cosmology. By performing an architecture search of CNNs, we identify networks that efficiently select non-artifacts (e.g., supernovae, variable stars, AGN, etc.) from artifacts (image defects, mis-subtractions, etc.), achieving the efficiency of previous work performed with random Forests, without the need to expend any effort in feature identification. The CNNs also help us identify a subset of mislabeled images. Performing a relabeling of the images in this subset, the resulting classification with CNNs is significantly better than previous results, lowering the false positive rate by 27% at a fixed missed detection rate of 0.05.

Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, Ariful Azad, "Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale", 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, doi: 10.1109/IPDPS49936.2021.00018

John Bachan, Jianlan Ye, Xuan Jiang, Tan Nguyen, Mahesh Natarajan, Maximilian Bremer, Cy Chan, "Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++", In 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS ’24), June 24, 2024,

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Maximilian Bremer, John Bachan, Cy Chan, Clint Dawson, "Adaptive total variation stable local timestepping for conservation laws", Journal of Computational Physics, April 21, 2022,

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Maximilian Bremer, John Bachan, Cy Chan, and Clint Dawson, "Speculative Parallel Execution for Local Timestepping", 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, May 21, 2021,

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Damian Rouson, Zhe Bai, Dan Bonachea, Kareem Ergawy, Ethan Gutmann, Michael Klemm, Katherine Rasmussen, Brad Richardson, Sameer Shende, David Torres, Yunhao Zhang, "Automatically parallelizing batch inference on deep neural networks using Fiats and Fortran 2023 `do concurrent`", Fifth International Workshop on Computational Aspects of Deep Learning (CADL), June 2025, doi: 10.25344/S4VG6T

This paper introduces novel programming strategies that leverage features of the Fortran 2023 standard of the International Standards Organization (ISO) to automatically parallelize computations on deep neural networks. The paper focuses on the interplay of object-oriented, parallel, and functional programming paradigms in the Fiats deep learning library. We demonstrate how several infrequently used language features play a role in enabling efficient, parallel execution. Specifically, the ability to explicitly declare that a procedure is pure facilitates inference in the context of the language’s loop-parallelism construct `do concurrent`. Also, explicitly prohibiting the overriding of a parent type’s type-bound procedures eliminates the need for dynamic dispatch in performance-critical code. Finally, this paper uses batch inference calculations on a neural network surrogate for atmospheric aerosol dynamics to demonstrate that LLVM Flang compiler’s automatic parallelization of `do concurrent` achieves roughly the same performance and scalability as achieved by OpenMP compiler directives. We also demonstrate that double-precision inference costs 37–72% longer runtime than default-real precision with most values in the range 57-60%.

Zhe Bai, Xishuo Wei, William Tang, Leonid Oliker, Zhihong Lin, Samuel Williams, "Transfer Learning Nonlinear Plasma Dynamic Transitions in Low Dimensional Embeddings via Deep Neural Networks", Machine Learning: Science and Technology, April 8, 2025, doi: 10.1088/2632-2153/adca83

Mustafa Mutiur Rahman, Zhe Bai, Jacob Robert King, Carl R. Sovinec, Xishuo Wei, Samuel Williams, Yang Liu, "Sparsified time-dependent Fourier neural operators for fusion simulations", Phys. Plasmas, December 4, 2024, 31:12, doi: 10.1063/5.0232503

Á Sánchez-Villar, Z Bai, N Bertelli, EW Bethel, J Hillairet, T Perciano, S Shiraiwa, GM Wallace, JC Wright, "Real-time capable modeling of ICRF heating on NSTX and WEST via machine learning approaches", Nuclear Fusion, August 12, 2024, 64:9, doi: 10.1088/1741-4326/ad645d

Zhe Bai, Abdelilah Essiari, Talita Perciano, Kristofer E Bouchard, "AutoCT: Automated CT registration, segmentation, and quantification", Software X, February 28, 2024, 26, doi: 10.1016/j.softx.2024.101673

GM Wallace, Z Bai, N Bertelli, EW Bethel, T Perciano, S Shiraiwa, JC Wright, "Towards Fast, Accurate Predictions of RF Simulations via Data-driven Modeling: Forward and Lateral Models", Conference, AIP Publishing, August 1, 2023, 2984, doi: https://doi.org/10.1063/5.0162422

Gregory Wallace, Zhe Bai, Robbie Sadre, Talita Perciano, Nicola Bertelli, Syun'ichi Shiraiwa, Wes Bethel, John Wright, "Towards fast and accurate predictions of radio frequency power deposition and current profile via data-driven modelling: applications to lower hybrid current drive", Journal of Plasma Physics, August 18, 2022, 88:4, doi: 10.1017/S0022377822000708

M. Avaylon, R. Sadre, Z. Bai, T. Perciano, "Adaptable Deep Learning and Probabilistic Graphical Model System for Semantic Segmentation", Advances in Artificial Intelligence and Machine Learnin, March 31, 2022, 2:288--302, doi: 10.54364/AAIML.2022.1119

Zhe Bai, Liqian Peng, "Non-intrusive nonlinear model reduction via machine learning approximations to low-dimensional operators", Advanced Modeling and Simulation in Engineering Sciences, 2021, 8:28, doi: 10.1186/s40323-021-00213-5

Caroline Ellis Hammond, Patricia Gonzalez-Guerrero, George Michelogiannakis, Meriam Gay Bautista, Nirmalendu Bikash, "Triangle Counting in the Temporal Domain", ISLPED: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, September 2024,

Meriam Gay Bautista, Darren Lyles, Kylie Huch, Patricia Gonzalez-Guerrero, George Michelogiannakis, "Area Efficient Asynchronous SFQ Pulse Round-Robin Distribution Network", IEEE Transactions on Circuits and Systems I: Regular Papers, November 2023,

Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Gay Bautista, George Michelogiannakis, "PaST-NoC: A Packet-Switched Superconducting Temporal NoC", IEEE Transactions on Applied Superconductivity, January 2023,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-Flux Shift Register for Race Logic and Its Applications", IEEE Transactions on Circuits and Systems I: Regular Papers, October 2022,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, Kylie Huch, George Michelogiannakis, "Superconducting Digital DIT Butterfly Unit for Fast Fourier Transform Using Race Logic", 2022 20th IEEE Interregional NEWCAS Conference (NEWCAS), IEEE, June 2022, 441-445,

Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, George Michelogiannakis, "Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators", 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), ACM, February 2022,

Download File: asplos2022.pdf (pdf: 1.9 MB)

Meriam Gay Bautista, Zhi Jackie Yao, Anastasiia Butko, Mariam Kiran, Mekena Metcalf, "Towards Automated Superconducting Circuit Calibration using Deep Reinforcement Learning", 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Tampa, FL, USA, IEEE, August 23, 2021, pp. 462-46, doi: 10.1109/ISVLSI51109.2021.00091

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-flux Shift Buffer for Race Logic", 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), August 2021,

George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021,

J. B. Bell, A. Nonaka, and A. L. Garcia, "A Study of Spherical and Sessile Droplet Dynamics by Fluctuating Hydrodynamics", Physics of Fluids, January 15, 2025, 37, doi: https://doi.org/10.1063/5.0249847

I. Srivastava, A. J. Nonaka, W. Zhang, A. L. Garcia, and J. B. Bell, "Molecular Fluctuations Inhibit Intermittency in Compressible Turbulence", Submitted for publication, January 11, 2025, doi: https://doi.org/10.48550/arXiv.2501.06396

M. Polimeno, C. Kim, F. Blanchette, I. Srivastava, A. Garcia, A. Nonaka, and J. Bell, "Thermodynamic consistency and fluctuations in mesoscopic stochastic simulations of reactive gas mixture", December 9, 2024, doi: https://doi.org/10.48550/arXiv.2412.07048

A. L. Garcia, J. B. Bell, A. Nonaka, I. Srivastava, D. Ladiges, C. Kim, "An Introduction to Computational Fluctuating Hydrodynamics", June 18, 2024, doi: https://doi.org/10.48550/arXiv.2406.12157

J. B. Bell, A. Nonaka, and A. L. Garcia, "Comment on "Brownian motion of droplets induced by thermal noise"", Submitted for publication, April 1, 2024, doi: https://doi.org/10.48550/arXiv.2404.01444

L. Esclapez, M. Day, J. Bell, A. Felden, C. Gilet, R. Grout, M. Henry de Frahan, E. Motheau, A. Nonaka, L. Owen, B. Perry, J. Rood, N. Wimer, and W. Zhang, "PeleLMeX: an AMR Low Mach Number Reactive Flow Simulation Code without level sub-cycling", Journal of Open Source Software, October 31, 2023, 8, doi: doi.org/10.21105/joss.05450

E. Mercado, H. T. Jung, C. Kim, A. L. Garcia, A. J. Nonaka, and J. B. Bell, "Surface Coverage Dynamics for Reversible Dissociative Adsorption on Finite Linear Lattices", J. Chem. Phys., October 12, 2023, 159:144107,

J. G. Wang, D. R. Ladiges, I. Srivastava, S. P. Carney, A. J. Nonaka, A. L. Garcia, J. B. Bell, "Steric effects in induced-charge electro-osmosis for strong electric fields", Physical Review Fluids, August 29, 2023, 8:083702,

I. Srivastava, D. R. Ladiges, A. Nonaka, A. L. Garcia, J. B. Bell, "Staggered Scheme for the Compressible Fluctuating Hydrodynamics of Multispecies Fluid Mixtures", Physical Review E, January 24, 2023, 107:015305, doi: 10.1103/PhysRevE.107.015305

D. R. Ladiges, J. G. Wang, I. Srivastava, S. P. Carney, A. Nonaka, A. L. Garcia, A. Donev, J. B. Bell, "Modeling Electrokinetic Flows with the Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm", Physical Review E, November 19, 2022, 106:035104, doi: 10.1103/PhysRevE.106.035104

J. Bell, A. Nonaka, A. L. Garcia, G. Eyink, "Thermal Fluctuations in the Dissipation Range of Homogeneous Isotropic Turbulence", J. Fluid Mech, March 24, 2022, 939,

Weiqun Zhang, Andrew Myers, Kevin Gott, Ann Almgren and John Bell, "AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications", The International Journal of High Performance Computing Applications, June 12, 2021,

Jordan Musser, Ann S Almgren, William D Fullmer, Oscar Antepara, John B Bell, Johannes Blaschke, Kevin Gott, Andrew Myers, Roberto Porcu, Deepak Rangarajan, Michele Rosso, Weiqun Zhang, and Madhava Syamlal, "MFIX:Exa: A Path Towards Exascale CFD-DEM Simulations", The International Journal of High Performance Computing Applications, April 16, 2021,

Daniel R. Ladiges, Sean P. Carney, Andrew Nonaka, Katherine Klymko, Guy C. Moore, Alejandro L. Garcia, Sachin R. Natesh, Aleksandar Donev, John B. Bell, "A Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm for Modeling Electrolytes", Physical Review Fluids, April 1, 2021, 6(4):044309,

Julian Bellavita, Mathias Jacquelin, Esmond G. Ng, Dan Bonachea, Johnny Corbino, Paul H. Hargrove, "symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver", 2023 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'23), ACM, November 13, 2023, doi: 10.1145/3624062.3624600

Sparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

Julian Bellavita, Alex Sim (advisor), John Wu (advisor), "Predicting Scientific Dataset Popularity Using dCache Logs", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’22), ACM Student Research Competition (SRC), Second place winner, 2022,

Poster (PDF)

The dCache installation is a storage management system that acts as a disk cache for high-energy physics (HEP) data. Storagespace on dCache is limited relative to persistent storage devices, therefore, a heuristic is needed to determine what data should be kept in the cache. A good cache policy would keep frequently accessed data in the cache, but this requires knowledge of future dataset popularity. We present methods for forecasting the number of times a dataset stored on dCache will be accessed in the future. We present a deep neural network that can predict future dataset accesses accurately, reporting a final normalized loss of 4.6e-8. We present a set of algorithms that can forecast future dataset accesses given an access sequence. Included are two novel algorithms, Backup Predictor and Last N Successors, that outperform other file prediction algorithms. Findings suggest that it is possible to anticipate dataset popularity in advance.

J. Bellavita, A. Sim, K. Wu, I. Monga, C. Guok, F. Würthwein, D. Davila, "Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches", 5th ACM International Workshop on System and Network Telemetry and Analysis (SNTA) 2022, in conjunction with The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2022, doi: 10.1145/3526064.3534111

Jan Balewski, Mercy G Amankwah, Roel Van Beeumen, E Wes Bethel, Talita Perciano, Daan Camps, "Quantum-parallel vectorized data encodings and computations on trapped-ion and transmon QPUs", Journal, February 10, 2024, 14, doi: https://doi.org/10.1038/s41598-024-53720-x

GM Wallace, Z Bai, N Bertelli, EW Bethel, T Perciano, S Shiraiwa, JC Wright, "Towards Fast, Accurate Predictions of RF Simulations via Data-driven Modeling: Forward and Lateral Models", Conference, AIP Publishing, August 1, 2023, 2984, doi: https://doi.org/10.1063/5.0162422

Gregory Wallace, Zhe Bai, Robbie Sadre, Talita Perciano, Nicola Bertelli, Syun'ichi Shiraiwa, Wes Bethel, John Wright, "Towards fast and accurate predictions of radio frequency power deposition and current profile via data-driven modelling: applications to lower hybrid current drive", Journal of Plasma Physics, August 18, 2022, 88:4, doi: 10.1017/S0022377822000708

M. G. Amankwah, D. Camps, E. W. Bethel, R. Van Beeumen, T. Perciano, "Quantum pixel representations and compression for N-dimensional images", Nature Scientific Reports, May 11, 2022, 12:7712, doi: 10.1038/s41598-022-11024-y

S. Zhang, R. Sadre, B. A. Legg, H. Pyles, T. Perciano, E. W. Bethel, D. Baker, O. Rübel, J. J. D. Yoreo, "Rotational dynamics and transition mechanisms of surface-adsorbed proteins", Proceedings of the National Academy of Sciences, April 11, 2022, 119:e202024211, doi: 10.1073/pnas.2020242119

E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, Dave Pugmire, Silvio Rizzi, Thompson, Will Usher, Gunther H. Weber, Brad Whitlock, Wolf, Kesheng Wu, "Proximity Portability and In Transit, M-to-N Data Partitioning and Movement in SENSEI", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_20

E. Wes Bethel, Burlen Loring, Utkarsh Ayatchit, David Camp, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, Kesheng Wu, "The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_13

E. W. Bethel, C. Heinemann, and T. Perciano, "Performance Tradeoffs in Shared-memory Platform Portable Implementations of a Stencil Kernel", Eurographics Symposium on Parallel Graphics and Visualization, June 14, 2021,

Jean Luca Bez, Analyzing Parallel I/O, ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), BoF, 2024,

Jean Luca Bez, Drishti: I/O Insights for All, ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), 2024,

Jean Luca Bez, IO500: The High-Performance Storage Community, ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), BoF, 2024,

Jean Luca Bez, Suren Byna, "Exploring the Proactive Data Containers Runtime System in VAST - A Case Study", 9th International Parallel Data Systems Workshop (PDSW), 2024,

Hiniduma, K., Byna, S., Bez, J. L., Madduri, R., "AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI", 36th International Conference on Scientific and Statistical Database Management (SSDBM 2024), 2024,

Egersdoerfer, C., Sareen, Arnav., Bez, J. L., Byna, S., Dai, D., "ION: Navigating HPC I/O Optimization Journey using Large Language Models", 16th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage'24), 2024, doi: 10.1145/3655038.3665950

Hammad Ather, Jean Luca Bez, Yankun Xia, Suren Byna, "Drilling Down I/O Bottlenecks with Cross-layer I/O Profile Exploration", 38th IEEE International Parallel & Distributed Processing Symposium, San Francisco, CA, USA, May 27, 2024,

Neeraj Rajesh, Keith Bateman, Jean Luca Bez, Suren Byna, Anthony Kougkas, Xian-He Sun, "TunIO: An AI-powered Framework for Optimizing HPC I/O", 38th IEEE International Parallel & Distributed Processing Symposium, San Fransicso, CA, US, May 27, 2024,

Jean Luca Bez, Houjun Tang, Scot Breitenfeld, Huihuo Zheng, Wei-Keng Liao, Kaiyuan Hou, Zanhua Huang, Suren Byna, "h5bench: Exploring HDF5 Access Patterns Performance in Pre-Exascale Platforms", Concurrency and Computation: Practice and Experience (CCPE), January 31, 2024,

Jakob Luettgau, Shane Snyder, Tyler Reddy, Nikolaus Awtrey, Kevin Harms, Jean Luca Bez, Rui Wang, Rob Latham, Philip Carns, "Enabling Agile Analysis of I/O Performance Data with PyDarshan", Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, USA, Association for Computing Machinery, November 12, 2023, 1380–1391, doi: 10.1145/3624062.3624207

Jean Luca Bez, Suren Byna, Shadi Ibrahim, "I/O Access Patterns in HPC Applications: A 360-Degree Survey", ACM Computing Surveys, September 15, 2023, 56, doi: 10.1145/3611007

André Ramos Carneiro, Jean Luca Bez, Carla Osthoff, Lucas Mello Schnorr, Philippe O.A. Navaux, "Uncovering I/O demands on HPC platforms: Peeking under the hood of Santos Dumont", Journal of Parallel and Distributed Computing, August 18, 2023, 182, doi: https://doi.org/10.1016/j.jpdc.2023.104744

Bin Dong, Jean Luca Bez, Suren Byna, "AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis.", In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’23), June 16, 2023,

Download File: IODiagnose-final.pdf (pdf: 1.9 MB)

Hammad Ather, Jean Luca Bez, Boyana Norris, Suren Byna, "Illuminating the I/O Optimization Path of Scientific Applications", High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings, Hamburg, Germany, Springer-Verlag, May 21, 2023, 22–41, doi: https://doi.org/10.1007/978-3-031-32041-5_2

The existing parallel I/O stack is complex and difficult to tune due to the interdependencies among multiple factors that impact the performance of data movement between storage and compute systems. When performance is slower than expected, end-users, developers, and system administrators rely on I/O profiling and tracing information to pinpoint the root causes of inefficiencies. Despite having numerous tools that collect I/O metrics on production systems, it is not obvious where the I/O bottlenecks are (unless one is an I/O expert), their root causes, and what to do to solve them. Hence, there is a gap between the currently available metrics, the issues they represent, and the application of optimizations that would mitigate performance slowdowns. An I/O specialist often checks for common problems before diving into the specifics of each application and workload. Streamlining such analysis, investigation, and recommendations could close this gap without requiring a specialist to intervene in every case. In this paper, we propose a novel interactive, user-oriented visualization, and analysis framework, called Drishti. This framework helps users to pinpoint various root causes of I/O performance problems and to provide a set of actionable recommendations for improving performance based on the observed characteristics of an application. We evaluate the applicability and correctness of Drishti using four use cases from distinct science domains and demonstrate its value to end-users, developers, and system administrators when seeking to improve an application’s I/O performance.

Md Kamal Hossain Chowdhury, Houjun Tang, Jean Luca Bez, Purushotham V. Bangalore, Suren Byna, "Efficient Asynchronous I/O with Request Merging", 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), St. Petersburg, FL, USA, IEEE, 2023, 628-636, doi: 10.1109/IPDPSW59300.2023.00107

Hammad Ather, Jean Luca Bez, Boyana Norris, Suren Byna, "Illuminating the I/O Optimization Path of Scientific Applications", International Conference on High Performance Computing (ISC'23), Springer Nature Switzerland, May 10, 2023, 22-41, doi: https://doi.org/10.1007/978-3-031-32041-5_2

Jean Luca Bez, Visualizing I/O bottlenecks with DXT Explorer 2.0, Analyzing Parallel I/O (BoF) is held in conjunction with SC22, 2022,

Jean Luca Bez, Hammad Ather, Suren Byna, "Drishti: Guiding End-Users in the I/O Optimization Journey", PDSW 2022, held in conjunction with SC22, 2022,

Jean Luca Bez, Where's the Bottleneck?, Berkeley Lab Research SLAM, October 7, 2022,

Jean Luca Bez, Suren Byna, April 2019 Darshan counters from the Cori supercomputer [Data set], Zenodo, 2022, doi: 10.5281/zenodo.6476501

Jean Luca Bez, Ahmad Maroof Karimi, Arnab K. Paul, Bing Xie, Suren Byna, Philip Carns, Sarp Oral, Feiyi Wang, Jesse Hanley, "Access Patterns and Performance Behaviors of Multi-layer Supercomputer I/O Subsystems under Production Load", 31st International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '22), Association for Computing Machinery, June 27, 2022, 43–55, doi: 10.1145/3502181.3531461

Jean Luca Bez, Suren Byna, Understanding I/O Behavior with Interactive Darshan Log Analysis, Exascale Computing Project (ECP) Community Days BoF, 2022,

Jean Luca Bez, Towards Understanding I/O Behavior with Interactive Exploration, Berkeley Lab’s Computing Sciences Area 2022 Postdoc Symposium, 2022,

André Ramos Carneiro, Jean Luca Bez, Carla Osthoff, Lucas Mello Schnorr, Phillipe Olivier Alexandre Navaux, "HPC Data Storage at a Glance: The Santos Dumont Experience", IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), IEEE, November 26, 2021, 157-166, doi: 10.1109/SBAC-PAD53543.2021.00027

Jean Luca Bez, Visualizing Darshan Extended Traces, Analyzing Parallel I/O (BoF) is held in conjunction with SC21, 2021,

Jean Luca Bez, Houjun Tang, Bing Xie, David Williams-Young, Rob Latham, Rob Ross, Sarp Oral, Suren Byna, "I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis", 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW), January 1, 2021, 15-22, doi: 10.1109/PDSW54622.2021.00008

Tonglin Li, Suren Byna, Quincey Koziol, Houjun Tang, Jean Luca Bez, Qiao Kang, "h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns", Cray User Group (CUG) 2021, January 1, 2021,

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Drew Paine, Sarah Poon, Michael Beach, Alpha N'Diaye, Patrick Huck, Lavanya Ramakrishnan, "Science Capsule: Towards Sharing and Reproducibility of Scientiﬁc Workﬂows", 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), November 15, 2021, doi: 10.1109/WORKS54523.2021.00014

Workflows are increasingly processing large volumes of data from scientific instruments, experiments and sensors. These workflows often consist of complex data processing and analysis steps that might include a diverse ecosystem of tools and also often involve human-in-the-loop steps. Sharing and reproducing these workflows with collaborators and the larger community is critical but hard to do without the entire context of the workflow including user notes and execution environment. In this paper, we describe Science Capsule, which is a framework to capture, share, and reproduce scientific workflows. Science Capsule captures, manages and represents both computational and human elements of a workflow. It automatically captures and processes events associated with the execution and data life cycle of workflows, and lets users add other types and forms of scientific artifacts. Science Capsule also allows users to create `workflow snapshots' that keep track of the different versions of a workflow and their lineage, allowing scientists to incrementally share and extend workflows between users. Our results show that Science Capsule is capable of processing and organizing events in near real-time for high-throughput experimental and data analysis workflows without incurring any significant performance overheads.

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Michael Beach, Drew Paine, Lavanya Ramakrishnan, "Science Capsule - Capturing the Data Life Cycle", Journal of Open Source Software, 2021, 6:2484, doi: 10.21105/joss.02484

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Brandon Cook, Damian Rouson, Dan Bonachea, "US04: Non-blocking Collective Subroutines", JTC1/SC22/WG5 ISO Fortran Standards document (WG5/N2245), June 2025,

Proposal for adding explicitly non-blocking collective subroutines to the worklist for Fortran 202Y.

Reinhold Bader, Dan Bonachea, HPC, "DIN1: Collectives over a specified team, req/spec/syntax/edits", INCITS/US Fortran Programming Language Standards Technical Committee (J3/25-127r1), June 2025,

This paper contains formal requirements, specifications, syntax and edits for Fortran 202Y proposal DIN1, collectives over a specified team.

Gary Klimowicz, Dan Bonachea, Patrick Fasano, Steve Lionel, "Formal specifications for the Fortran preprocessor (FPP)", INCITS/US Fortran Programming Language Standards Technical Committee (J3/25-142r1), June 2025,

Damian Rouson, Zhe Bai, Dan Bonachea, Kareem Ergawy, Ethan Gutmann, Michael Klemm, Katherine Rasmussen, Brad Richardson, Sameer Shende, David Torres, Yunhao Zhang, "Automatically parallelizing batch inference on deep neural networks using Fiats and Fortran 2023 `do concurrent`", Fifth International Workshop on Computational Aspects of Deep Learning (CADL), June 2025, doi: 10.25344/S4VG6T

This paper introduces novel programming strategies that leverage features of the Fortran 2023 standard of the International Standards Organization (ISO) to automatically parallelize computations on deep neural networks. The paper focuses on the interplay of object-oriented, parallel, and functional programming paradigms in the Fiats deep learning library. We demonstrate how several infrequently used language features play a role in enabling efficient, parallel execution. Specifically, the ability to explicitly declare that a procedure is pure facilitates inference in the context of the language’s loop-parallelism construct `do concurrent`. Also, explicitly prohibiting the overriding of a parent type’s type-bound procedures eliminates the need for dynamic dispatch in performance-critical code. Finally, this paper uses batch inference calculations on a neural network surrogate for atmospheric aerosol dynamics to demonstrate that LLVM Flang compiler’s automatic parallelization of `do concurrent` achieves roughly the same performance and scalability as achieved by OpenMP compiler directives. We also demonstrate that double-precision inference costs 37–72% longer runtime than default-real precision with most values in the range 57-60%.

Brandon Cook, Dan Bonachea, "Requirements for US20: Local Prefix Operation Intrinsics", INCITS/US Fortran Programming Language Standards Technical Committee (J3/25-145), June 2025,

Scan, or prefix reduction, operations are fundamental building blocks in parallel algorithms and data manipulation tasks. The SCAN and CO_SCAN proposal (J3/23-235r2) received "mixed support" at a previous meeting, and prospective work items, including prefix reduction operations, were "conditionally accepted" at the 2024 meeting, pending further discussion. They were again discussed at the February 2025 WG5 meeting, and subsequently promoted to "accepted" work item US20 via WG5 letter ballot in May 2025 (WG5/N-2239). The result of the WG5 vote was 20 yes 5 no and 1 undecided with several informative comments.

This document focuses on requirements exclusively for the local prefix reduction variant, refining previous concepts based on community and WG5 feedback. By focusing exclusively on the local variant our aim is to allow consideration independent of the closely related but distinct collective subroutines. We do however revisit use cases briefly for clarity as we are now addressing the two variants separately.

Brandon Cook, Dan Bonachea, "Requirements for US20 collective subroutines for prefix operations", INCITS/US Fortran Programming Language Standards Technical Committee (J3/25-144), June 2025,

Scan, or prefix reduction, operations are fundamental building blocks in parallel algorithms and data manipulation tasks. The original `SCAN` and `CO_SCAN` proposal received "mixed support" , and prefix reduction operations were subsequently "conditionally accepted" as a prospective work item for F202Y at the 2024 WG5 meeting, pending further discussion and refinement. They were again discussed at the February 2025 WG5 meeting, and subsequently promoted to "accepted" work item US20 via WG5 letter ballot in May 2025 (WG5/N-2239). The result of the WG5 vote was 20 yes 5 no and 1 undecided with several informative comments.

This paper focuses exclusively on requirements for the collective subroutine variant refining previous concepts based on community and WG5 feedback. Our aim is to allow consideration independent of the closely related but distinct local intrinsics. We briefly revisit use cases as we are now addressing the two variants separately.

Paul H. Hargrove, Dan Bonachea, "Investigation into the Performance Benefits of Exposing Network Backpressure in UPC++ and GASNet-EX", Lawrence Berkeley National Laboratory Technical Report, May 2025, LBNL 2001668, doi: 10.25344/S4088R

This document is a brief summary of the research, and supporting development efforts, conducted by the project "Investigation into Improving Dynamic Adaptivity to System-Level Asynchrony in UPC++". We tested the hypothesis "The UPC++ and GASNet-EX runtimes can expose information from the network stack that enables applications to dynamically adapt to congestion, improving total throughput". We present experimental results from both a microbenchmark and an application benchmark that support this hypothesis.

Katherine Rasmussen, Damian Rouson, Dan Bonachea, Julienne + Assert == Correctness-Checking for Functional Fortran, Improving Scientific Software Conference, April 2025, doi: 10.25344/S4401K

The agile software development practice of test-driven development (TDD) advocates unit testing as an essential driver of software design and construction. In TDD, tests of individual units of software (e.g., procedures) serve documentation and verification roles. As documentation, tests specify the behaviors required for code correctness. Executing a suite of tests verifies that the actual behaviors satisfy the documented requirements. As inspired by the Veggies and Garden unit testing frameworks for modern Fortran, the more lightweight Julienne framework uses the Template Method pattern to report serial or parallel test results in the form of a specification (https://go.lbl.gov/julienne). As such, Julienne’s test output names the test subject (e.g., a class or type-bound procedure), the expected behavior, the test outcome (pass or fail), and provides diagnostic information if a test fails.

The use of Julienne centers around users defining a test in the form of a non-abstract child type that extends Julienne’s abstract test_t derived type. The user’s child type thus inherits an obligation to define type-bound procedures that name the subject of the test and provide the test results. As a template method, test_t’s type-bound “report” procedure invokes the user’s procedures by referencing the aforementioned deferred bindings and reporting on the collective success or failure across multiple images (processes) in programs that use Fortran’s multi-image parallel programming features.

Working from the example test suite in the Julienne repository, attendees will learn how to write and run a simple test suite, including how to use Julienne’s string-handling for producing rich diagnostic information from a failing test. Attendees will also see examples of Julienne’s use in other Berkeley Lab software projects such as the Fiats deep learning library and Matcha T-cell motility simulator.

Attendees will also learn a functional programming pattern developed and used by the Berkeley Lab Fortran presenters. Functional programming centers around the definition of pure procedures that are free of side effects, including file input and output. To supplement the material on external verification via unit tests, this tutorial will also introduce our Assert utility library and Assert’s use for runtime correctness-checking inside procedures (https://go.lbl.gov/assert). Attendees will learn how Assert addresses a common reason developers cite for not writing pure procedures: a desire to produce diagnostic output when debugging code. We posit that most developers seek output to verify an expectation about data and that such expectations can be stated in assertions that take the form of logical expressions. Attendees will learn how Assert empowers developers to obtain rich, customized diagnostic information through character stop codes when an assertion fails, resulting in error termination. Attendees will also learn how to use Assert in such a way that guarantees zero runtime overhead by automatically eliminating assertions in production builds of user software.

Conference Site

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Caffeine: A parallel runtime library for supporting modern Fortran compilers", Journal of Open Source Software, edited by Daniel S. Katz, March 29, 2025, 10(107), doi: 10.21105/joss.07895

The Fortran programming language standard added features supporting single-program, multiple data (SPMD) parallel programming and loop parallelism beginning with Fortran 2008. In Fortran, SPMD programming involves the creation of a fixed number of images (instances) of a program that execute asynchronously in shared or distributed memory, except where a program uses specific synchronization mechanisms. Fortran’s “coarray’’ distributed data structures offer a subscripted, multidimensional array notation defining a partitioned global address space (PGAS). One image can use this notation for one-sided access to another image’s slice of a coarray.

The CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine) provides a runtime library that supports Fortran’s SPMD features. Caffeine implements inter-process communication by building atop the GASNet-EX exascale networking middleware library. Caffeine is the first implementation of the compiler- and runtime-agnostic Parallel Runtime Interface for Fortran (PRIF) specification. Any compiler that targets PRIF can use any runtime that supports PRIF. Caffeine supports researching the novel approach of writing most of a compiler’s parallel runtime library in the language being compiled: Caffeine is primarily implemented using Fortran’s non-parallel features, with a thin C-language layer that invokes the external GASNet-EX communication library. Exploring this approach in open source lowers a barrier to contributions from the compiler’s users: Fortran programmers. Caffeine also facilitates research such as investigating various optimization opportunities that exploit specific hardware such as shared memory or specific interconnects.

Dan Bonachea, HPC, "F202Y feature request: collectives over a specified team", INCITS/US Fortran Programming Language Standards Technical Committee (J3/25-125r1), February 2025,

As of Fortran 2023, the collective subroutine intrinsics
(CO_BROADCAST, CO_MAX, CO_MIN, CO_REDUCE, and CO_SUM) may only be
executed over the current team, as defined by the CHANGE TEAM
construct. This becomes very awkward when one needs to execute such a
collective over an ancestor team; because there is no way to directly
express that without closing the CHANGE TEAM construct, and invoking
END TEAM may have undesired side-effects such as deallocating
team-specific coarrays. It would also be convenient to allow
collectives directly over a child team without forcing the
synchronization side effects associated with a CHANGE TEAM to that
child team.

The collective subroutines of Fortran should support execution in a
specified team that is not the current team.

Paper PASSED by roll call vote at INCITS/US Fortran Programming Language Standards Technical Committee meeting #235

Gary Klimowicz, Dan Bonachea, Aury Shafran, "Fortran preprocessor requirements", INCITS/US Fortran Programming Language Standards Technical Committee (J3/25-114r2), February 2025,

Many existing Fortran projects make extensive use of C preprocessor
directives and macro expansion, despite the lack of an FPP standard.
This is usually done to tailor the code to specific environments, such
as target compilers or machines.

Unfortunately, more complex use cases fail to be portable between
different implementations. This is enough of a problem that WG 5 raised
this as the number 2 issue to address in Fortran 202y, behind generics.

This is not a new problem, as evidenced by the J3 discussions from the
mid 1990s. The introduction of CoCo in Fortran 95 did not solve the
problem, either, because it was not a mandatory part of the standard and
because it was not compatible with the preprocessor syntax used by many
existing Fortran projects.

This document attempts to define the requirements for a mandatory
Fortran preprocessor based on the preprocessor syntax already in common
use today. The guiding principle is to promote Fortran program
portability by defining consistent syntax and semantics of a useful
subset of CPP. Some FPP behavior will be slightly different from CPP, in
order to accommodate some Fortran idiosyncrasies.

A major overarching goal of this effort is to standardize de facto
current practice for preprocessing in Fortran compilers and code. It is
the standard's responsibility to standardize syntax in order to settle
minor divergences that have arisen amongst pre-standard FPP
implementations, to the detriment of portability for end users.

Paper PASSED by unanimous consent at INCITS/US Fortran Programming Language Standards Technical Committee meeting #235

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.5", Lawrence Berkeley National Laboratory Tech Report, December 2024, LBNL 2001636, doi: 10.25344/S4CG6G

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is primarily responsible for implementing coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, teams and collective subroutines. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF subroutines. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler's own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF): A Multi-Image Solution for LLVM Flang", Tenth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC2024), Atlanta, Georgia, USA, IEEE, November 2024, doi: 10.25344/S4N017

Download File: LLVM-HPC24_PRIF_Slides.pdf (pdf: 975 KB)

Fortran compilers that provide support for Fortran’s native parallel features often do so with a runtime library that depends on details of both the compiler implementation and the communication library, while others provide limited or no support at all. This paper introduces a new generalized interface that is both compiler- and runtime-library-agnostic, providing flexibility while fully supporting all of Fortran’s parallel features. The Parallel Runtime Interface for Fortran (PRIF) was developed to be portable across shared- and distributed-memory systems, with varying operating systems, toolchains and architectures. It achieves this by defining a set of Fortran procedures corresponding to each of the parallel features defined in the Fortran standard that may be invoked by a Fortran compiler and implemented by a runtime library. PRIF aims to be used as the solution for LLVM Flang to provide parallel Fortran support. This paper also briefly describes our PRIF prototype implementation: Caffeine.

Talk Slides

Katherine Rasmussen, Damian Rouson, Dan Bonachea, Brad Richardson, "A Full-Stack Exploration of Language-Based Parallelism in Fortran 2023", Poster at CARLA2024: Latin America High Performance Computing Conference, September 30, 2024, doi: 10.25344/S4RP5K

This poster explores native parallel features in Fortran 2023 through the lens of supporting applications with libraries, compilers, and parallel runtimes. The language revision informally named Fortran 2008 introduced parallelism in the form of Single Program Multiple Data (SPMD) execution with two broad feature sets: (1) loop-level parallelism via do concurrent and (2) a Partitioned Global Address Space (PGAS) comprised of distributed “coarray” data structures. Fortran’s native parallelism has demonstrated high performance [1] and reduced the burden of inserting what sometimes amounts to more directives than code. Several compilers support both feature sets, typically by translating do concurrent into serial do loops annotated by parallel directives and by translating SPMD/PGAS features into direct calls to a communication library. Our research focuses primarily on two questions: (1) can the compiler’s parallel runtime library be developed in the language being compiled (Fortran) and (2) can we define an interface to the runtime that liberates compilers from being hardwired to one runtime and vice versa. We are answering these questions by developing the Parallel Runtime Interface for Fortran (PRIF) [2] and the Co-Array Fortran Framework of Efficient Interfaces to Network Environments (Caffeine) [3]. Caffeine is initially targeting adoption by LLVM Flang, a new open-source Fortran compiler developed by a broad community in industry, academia, and government labs. We are also exploring the use of these features in Inference-Engine, a deep learning library designed to facilitate neural network training and inference for high-performance computing applications written in modern Fortran.

CARLA'2024

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.4", Lawrence Berkeley National Laboratory Tech Report, July 12, 2024, LBNL 2001604, doi: 10.25344/S4WG64

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, Parallel Runtime Interface for Fortran (PRIF): A Compiler/Runtime-Library Agnostic Interface to Support the Parallel Features of Fortran 2023, Platform for Advanced Scientific Computing (PASC) Modern Fortran Minisymposium, June 5, 2024,

Download File: PRIF-PASC24.pdf (pdf: 1.6 MB)

Fortran 2023 natively supports single-program, multiple-data parallel programming with a partitioned global address space and collective subroutines, synchronization, atomics, locks, and more. Each of the four actively developed compilers that support Fortran’s parallel features uses its own parallel runtime library. The Parallel Runtime Interface for Fortran (PRIF) proposes to liberate compiler development from reliance on a single runtime and empower runtime developers to support more than one compiler. PRIF also aims to broaden the community of runtime developers to include the Fortran compiler’s users: Fortran programmers. PRIF does so by specifying the interface in Fortran, which makes it attractive to write the parallel runtime library in Fortran. Additionally, PRIF has been designed to be portable across both shared and distributed memory, varying architectures, as well as different operating systems. In this talk, I will describe the motivation behind the development of PRIF, describe the design of the interface itself and the benefits of adopting it. I will also provide a brief status report on the first PRIF implementation: Caffeine.

PASC'24 site

Dan Bonachea, Paul H. Hargrove, "GASNet-EX Specification Collection, Revision 2024.5.0", Lawrence Berkeley National Laboratory Tech Report, May 2024, LBNL 2001595, doi: 10.25344/S4160B

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in emerging exascale systems. It provides network-independent, high-performance communication primitives including Remote Memory Access (RMA) and Active Messages (AM). GASNet-EX is an evolution of the popular GASNet communication system, building upon over 20 years of lessons learned, and the primary goals are high performance, interface portability, and expressiveness. The library has been used to implement parallel programming models and libraries such as UPC, UPC++, Fortran coarrays, Legion, Chapel, and many others.

This anthology collects together the four separate volumes that currently comprise the GASNet-EX specification, as of the 2024.5.0 release of GASNet-EX.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.3", Lawrence Berkeley National Laboratory Tech Report, May 3, 2024, LBNL 2001590, doi: 10.25344/S4501W

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Damian Rouson, Brad Richardson, Dan Bonachea, Katherine Rasmussen, "Parallel Runtime Interface for Fortran (PRIF) Design Document, Revision 0.2", Lawrence Berkeley National Laboratory Tech Report, December 20, 2023, LBNL 2001563, doi: 10.25344/S4DG6S

This design document proposes an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001561, December 2023, doi: 10.25344/S4J592

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Julian Bellavita, Mathias Jacquelin, Esmond G. Ng, Dan Bonachea, Johnny Corbino, Paul H. Hargrove, "symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver", 2023 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'23), ACM, November 13, 2023, doi: 10.1145/3624062.3624600

Sparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran, Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23), November 12, 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models.

The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.

Come join us to learn about some productive and performant parallel programming models!

SC23 event page

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran (CUF23), ECP/NERSC/OLCF Tutorial, July 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models. This tutorial should be accessible to users with little-to-no parallel programming experience, and everyone is welcome. A partial differential equation example will be demonstrated in all three programming models along with performance and scaling results on big machines. That example and others will be provided in a cloud instance and Docker container. Attendees will be shown how to compile and run these programming examples, and provided opportunities to experiment with different parameters and code alternatives while being able to ask questions and share their own observations. Come join us to learn about some productive and performant parallel programming models!

Secondary tutorial sites by event sponsors:

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 31, 2023, LBNL 2001516, doi: 10.25344/S46W2J

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Johnny Corbino, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'23)", Poster at Exascale Computing Project (ECP) Annual Meeting 2023, January 2023,

The Pagoda project is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. The first component is GASNet-EX, a portable, high-performance, global-address-space communication library. The second component is UPC++, a C++ template library. Together, these libraries enable agile, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

GASNet-EX is a portable, high-performance communications middleware library which leverages hardware support to implement Remote Memory Access (RMA) and Active Message communication primitives. GASNet-EX supports a broad ecosystem of alternative HPC programming models, including UPC++, Legion, Chapel and multiple implementations of UPC and Fortran Coarrays. GASNet-EX is implemented directly over the native APIs for networks of interest in HPC. The tight semantic match of GASNet-EX APIs to the client requirements and hardware capabilities often yields better performance than competing libraries.

UPC++ provides high-level productivity abstractions appropriate for Partitioned Global Address Space (PGAS) programming such as: remote memory access (RMA), remote procedure call (RPC), support for accelerators (e.g. GPUs), and mechanisms for aggressive asynchrony to hide communication costs. UPC++ implements communication using GASNet-EX, delivering high performance and portability from laptops to exascale supercomputers. HPC application software using UPC++ includes: MetaHipMer2 metagenome assembler, SIMCoV viral propagation simulation, NWChemEx TAMM, and graph computation kernels from ExaGraph.

"Berkeley Lab’s Networking Middleware GASNet Turns 20: Now, GASNet-EX is Gearing Up for the Exascale Era", Linda Vu, HPCWire (Lawrence Berkeley National Laboratory CS Area Communications), December 7, 2022, doi: 10.25344/S4BP4G

GASNet Celebrates 20th Anniversary

For 20 years, Berkeley Lab’s GASNet has been fueling developers’ ability to tap the power of massively parallel supercomputers more effectively. The middleware was recently upgraded to support exascale scientific applications.

Katherine Rasmussen, Damian Rouson, Naje George, Dan Bonachea, Hussain Kadhem, Brian Friesen, "Agile Acceleration of LLVM Flang Support for Fortran 2018 Parallel Programming", Research Poster at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), November 2022, doi: 10.25344/S4CP4S

The LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “Coarray Fortran.” We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).

Extended Abstract and Poster

Video presentation

Paul H. Hargrove, Dan Bonachea, "GASNet-EX RMA Communication Performance on Recent Supercomputing Systems", 5th Annual Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'22), November 2022, doi: 10.25344/S40C7D

Partitioned Global Address Space (PGAS) programming models, typified by systems such as Unified Parallel C (UPC) and Fortran coarrays, expose one-sided Remote Memory Access (RMA) communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in emerging exascale machines. The library is an evolution of the popular GASNet communication system, building upon 20 years of lessons learned. We present microbenchmark results which demonstrate the RMA performance of GASNet-EX is competitive with MPI implementations on four recent, high-impact, production HPC systems. These results are an update relative to previously published results on older systems. The networks measured here are representative of hardware currently used in six of the top ten fastest supercomputers in the world, and all of the exascale systems on the U.S. DOE road map.

Talk Slides

Damian Rouson, Dan Bonachea, "Caffeine: CoArray Fortran Framework of Efficient Interfaces to Network Environments", Proceedings of the Eighth Annual Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC2022), Dallas, Texas, USA, IEEE, November 2022, doi: 10.25344/S4459B

This paper provides an introduction to the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine), a parallel runtime library built atop the GASNet-EX exascale networking library. Caffeine leverages several non-parallel Fortran features to write type- and rank-agnostic interfaces and corresponding procedure definitions that support parallel Fortran 2018 features, including communication, collective operations, and related services. One major goal is to develop a runtime library that can eventually be considered for adoption by LLVM Flang, enabling that compiler to support the parallel features of Fortran. The paper describes the motivations behind Caffeine's design and implementation decisions, details the current state of Caffeine's development, and previews future work. We explain how the design and implementation offer benefits related to software sustainability by lowering the barrier to user contributions, reducing complexity through the use of Fortran 2018 C-interoperability features, and high performance through the use of a lightweight communication substrate.

Talk Slides

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001480, doi: 10.25344/S4M59P

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Dan Bonachea, Paul H. Hargrove, An Introduction to GASNet-EX for Chapel Users, 9th Annual Chapel Implementers and Users Workshop (CHIUW 2022), June 10, 2022,

Have you ever typed "export CHPL_COMM=gasnet"? If you’ve used Chapel with multi-locale support on a system without "Cray" in the model name, then you’ve probably used GASNet. Did you ever wonder what GASNet is? What GASNet should mean to you? This talk aims to answer those questions and more. Chapel has system-specific implementations of multi-locale communication for Cray-branded systems including the Cray XC and HPE Cray EX lines. On other systems, Chapel communication uses the GASNet communication library embedded in third-party/gasnet. In this talk, that third-party will introduce itself to you in the first person.

Video Presentation

Paul H. Hargrove, Dan Bonachea, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'22)", Poster at Exascale Computing Project (ECP) Annual Meeting 2022, May 5, 2022,

We present UPC++ and GASNet-EX, distributed libraries which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Procedure Call (RPC) and for Remote Memory Access (RMA) to host and GPU memories. The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001452, doi: 10.25344/S4530J

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

Daniel Waters, Colin A. MacLean, Dan Bonachea, Paul H. Hargrove, "Demonstrating UPC++/Kokkos Interoperability in a Heat Conduction Simulation (Extended Abstract)", Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2021, doi: 10.25344/S4630V

We describe the replacement of MPI with UPC++ in an existing Kokkos code that simulates heat conduction within a rectangular 3D object, as well as an analysis of the new code’s performance on CUDA accelerators. The key challenges were packing the halos in Kokkos data structures in a way that allowed for UPC++ remote memory access, and streamlining synchronization costs. Additional UPC++ abstractions used included global pointers, distributed objects, remote procedure calls, and futures. We also make use of the device allocator concept to facilitate data management in memory with unique properties, such as GPUs. Our results demonstrate that despite the algorithm’s good semantic match to message passing abstractions, straightforward modifications to use UPC++ communication deliver vastly improved performance and scalability in the common case. We find the one-sided UPC++ version written in a natural way exhibits good performance, whereas the message-passing version written in a straightforward way exhibits performance anomalies. We argue this represents a productivity benefit for one-sided communication models.

PAW-ATM'21

Amir Kamil, Dan Bonachea, "Optimization of Asynchronous Communication Operations through Eager Notifications", Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2021, doi: 10.25344/S42C71

UPC++ is a C++ library implementing the Asynchronous Partitioned Global Address Space (APGAS) model. We propose an enhancement to the completion mechanisms of UPC++ used to synchronize communication operations that is designed to reduce overhead for on-node operations. Our enhancement permits eager delivery of completion notification in cases where the data transfer semantics of an operation happen to complete synchronously, for example due to the use of shared-memory bypass. This semantic relaxation allows removing significant overhead from the critical path of the implementation in such cases. We evaluate our results on three different representative systems using a combination of microbenchmarks and five variations of the the HPCChallenge RandomAccess benchmark implemented in UPC++ and run on a single node to accentuate the impact of locality. We find that in RMA versions of the benchmark written in a straightforward manner (without manually optimizing for locality), the new eager notification mode can provide up to a 25% speedup when synchronizing with promises and up to a 13.5x speedup when synchronizing with conjoined futures. We also evaluate our results using a graph matching application written with UPC++ RMA communication, where we measure overall speedups of as much as 11% in single-node runs of the unmodified application code, due to our transparent enhancements.

PAW-ATM'21

Paul H. Hargrove, Dan Bonachea, Colin A. MacLean, Daniel Waters, "GASNet-EX Memory Kinds: Support for Device Memory in PGAS Programming Models", The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'21) Research Poster, November 2021, doi: 10.25344/S4P306

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work includes two major components: UPC++ (a C++ template library) and GASNet-EX (a portable, high-performance communication library). This poster describes recent advances in GASNet-EX to efficiently implement Remote Memory Access (RMA) operations to and from memory on accelerator devices such as GPUs. Performance is illustrated via benchmark results from UPC++ and the Legion programming system, both using GASNet-EX as their communications library.

Katherine A. Yelick, Amir Kamil, Damian Rouson, Dan Bonachea, Paul H. Hargrove, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC21), Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), November 15, 2021,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. UPC++ offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between computation and asynchronous data movement. UPC++ supports simple/regular data structures as well as more elaborate distributed applications where communication is fine-grained and/or irregular. UPC++ provides a uniform abstraction for one-sided RMA between host and GPU/accelerator memories anywhere in the system. UPC++'s support for aggressive asynchrony enables applications to effectively overlap communication and reduce latency stalls, while the underlying GASNet-EX communication library delivers efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces UPC++, covering the memory and execution models and basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into application proxy examples. We examine a few UPC++ applications with irregular communication (metagenomic assembler and COVID-19 simulation) and describe how they utilize UPC++ to optimize communication performance.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001425, doi: 10.25344/S4XK53

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

Dan Bonachea, "UPC++ as_eager Working Group Draft, Revision 2020.6.2", Lawrence Berkeley National Laboratory Tech Report, August 9, 2021, LBNL 2001416, doi: 10.25344/S4FK5R

This draft proposes an extension for a new future-based completion variant that can be more effectively streamlined for RMA and atomic access operations that happen to be satisfied at runtime using purely node-local resources. Many such operations are most efficiently performed synchronously using load/store instructions on shared-memory mappings, where the actual access may only require a few CPU instructions. In such cases we believe it’s critical to minimize the overheads imposed by the UPC++ runtime and completion queues, in order to enable efficient operation on hierarchical node hardware using shared-memory bypass.

The new upcxx::{source,operation}_cx::as_eager_future() completion variant accomplishes this goal by relaxing the current restriction that future-returning access operations must return a non-ready future whose completion is deferred until a subsequent explicit invocation of user-level progress. This relaxation allows access operations that are completed synchronously to instead return a ready future, thereby avoiding most or all of the runtime costs associated with deferment of future completion and subsequent mandatory entry into the progress engine.

We additionally propose to make this new as_eager_future() completion variant the new default completion for communication operations that currently default to returning a future. This should encourage use of the streamlined variant, and may provide performance improvements to some codes without source changes. A mechanism is proposed to restore the legacy behavior on-demand for codes that might happen to rely on deferred completion for correctness.

Finally, we propose a new as_eager_promise() completion variant that extends analogous improvements to promise-based completion, and corresponding changes to the default behavior of as_promise().

Paul H. Hargrove, Dan Bonachea, Max Grossman, Amir Kamil, Colin A. MacLean, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'21)", Poster at Exascale Computing Project (ECP) Annual Meeting 2021, April 2021,

We present UPC++ and GASNet-EX, which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Memory Access (RMA) and Remote Procedure Call (RPC). The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2021.3.0", Lawrence Berkeley National Laboratory Tech Report, March 31, 2021, LBNL 2001388, doi: 10.25344/S4K881

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

Dan Bonachea, GASNet-EX: A High-Performance, Portable Communication Library for Exascale, Berkeley Lab – CS Seminar, March 10, 2021,

Download File: GASNet-2021-LBL-seminar-slides.pdf (pdf: 9.1 MB)

Partitioned Global Address Space (PGAS) models, pioneered by languages such as Unified Parallel C (UPC) and Co-Array Fortran, expose one-sided communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in future exascale machines. The library is an evolution of the popular GASNet communication system, building on 20 years of lessons learned. We describe several features and enhancements that have been introduced to address the needs of modern runtimes and exploit the hardware capabilities of emerging systems. Microbenchmark results demonstrate the RMA performance of GASNet-EX is competitive with several MPI implementations on current systems. GASNet-EX provides communication services that help to deliver speedups in HPC applications written using the UPC++ library, enabling new science on pre-exascale systems.

Y Segawa, H Hirose, D Kaneko, M Hasegawa, S Adachi, P Ade, MAOA Faúndez, Y Akiba, K Arnold, J Avva, C Baccigalupi, D Barron, D Beck, S Beckman, F Bianchini, D Boettger, J Borrill, J Carron, S Chapman, K Cheung, Y Chinone, K Crowley, A Cukierman, T De Haan, M Dobbs, R Dunner, HE Bouhargani, T Elleflot, J Errard, G Fabbian, S Feeney, C Feng, T Fujino, N Galitzki, N Goeckner-Wald, J Groh, G Hall, N Halverson, T Hamada, M Hazumi, C Hill, L Howe, Y Inoue, J Ito, G Jaehnig, O Jeong, N Katayama, B Keating, R Keskitalo, S Kikuchi, T Kisner, N Krachmalnicoff, A Kusaka, AT Lee, D Leon, E Linder, LN Lowry, A Mangu, F Matsuda, Y Minami, J Montgomery, M Navaroli, H Nishino, J Peloton, ATP Pham, D Poletti, G Puglisi, C Raum, CL Reichardt, C Ross, M Silva-Feaver, P Siritanasak, R Stompor, A Suzuki, O Tajima, S Takakura, S Takatori, D Tanabe, GP Teply, C Tsai, C Verges, B Westbrook, Y Zhou, "Method for rapid performance validation of large TES bolometer array for POLARBEAR-2A using a coherent millimeter-wave source", AIP Conference Proceedings, 2021, 2319, doi: 10.1063/5.0038197

M Tristram, AJ Banday, KM Górski, R Keskitalo, CR Lawrence, KJ Andersen, RB Barreiro, J Borrill, HK Eriksen, R Fernandez-Cobos, TS Kisner, E Martínez-González, B Partridge, D Scott, TL Svalheim, H Thommesen, IK Wehus, "Planck constraints on the tensor-to-scalar ratio", Astronomy and Astrophysics, 2021, 647, doi: 10.1051/0004-6361/202039585

G Puglisi, R Keskitalo, T Kisner, JD Borrill, Simulating Calibration and Beam Systematics for a Future CMB Space Mission with the TOAST Package, Research Notes of the AAS, Pages: 137--137 2021, doi: 10.3847/2515-5172/ac0823

N Aghanim, Y Akrami, M Ashdown, J Aumont, C Baccigalupi, M Ballardini, AJ Banday, RB Barreiro, N Bartolo, S Basak, R Battye, K Benabed, JP Bernard, M Bersanelli, P Bielewicz, JJ Bock, JR Bond, J Borrill, FR Bouchet, F Boulanger, M Bucher, C Burigana, RC Butler, E Calabrese, JF Cardoso, J Carron, A Challinor, HC Chiang, J Chluba, LPL Colombo, C Combet, D Contreras, BP Crill, F Cuttaia, P De Bernardis, G De Zotti, J Delabrouille, JM Delouis, E DI Valentino, JM DIego, O Doré, M Douspis, A Ducout, X Dupac, S Dusini, G Efstathiou, F Elsner, TA Enßlin, HK Eriksen, Y Fantaye, M Farhang, J Fergusson, R Fernandez-Cobos, F Finelli, F Forastieri, M Frailis, AA Fraisse, E Franceschi, A Frolov, S Galeotta, S Galli, K Ganga, RT Génova-Santos, M Gerbino, T Ghosh, J González-Nuevo, KM Górski, S Gratton, A Gruppuso, JE Gudmundsson, J Hamann, W Handley, FK Hansen, D Herranz, SR Hildebrandt, E Hivon, Z Huang, AH Jaffe, WC Jones, A Karakci, E Keihänen, R Keskitalo, K Kiiveri, J Kim, TS Kisner, L Knox, N Krachmalnicoff, M Kunz, H Kurki-Suonio, G Lagache, JM Lamarre, A Lasenby, M Lattanzi, CR Lawrence, M Le Jeune, P Lemos, J Lesgourgues, F Levrier, A Lewis, M Liguori, "Erratum: Planck 2018 results: VI. Cosmological parameters (Astronomy and Astrophysics (2020) 641 (A6) DOI: 10.1051/0004-6361/201833910)", Astronomy and Astrophysics, 2021, 652, doi: 10.1051/0004-6361/201833910e

Zhe Bai, Abdelilah Essiari, Talita Perciano, Kristofer E Bouchard, "AutoCT: Automated CT registration, segmentation, and quantification", Software X, February 28, 2024, 26, doi: 10.1016/j.softx.2024.101673

Luca Pion-Tonachini, Kristofer Bouchard, Hector Garcia Martin, Sean Peisert, W. Bradley Holtz, Anil Aswani, Dipankar Dwivedi, Haruko Wainwright, Ghanshyam Pilania, Benjamin Nachman, Babetta L. Marrone, Nicola Falco, Prabhat, Daniel Arnold, Alejandro Wolf-Yadlin, Sarah Powers, Sharlee Climer, Quinn Jackson, Ty Carlson, Michael Sohn, Petrus Zwart, Neeraj Kumar, Amy Justice, Claire Tomlin, Daniel Jacobson, Gos Micklem, Georgios V. Gkoutos, Peter J. Bickel, Jean-Baptiste Cazier, Juliane Müller, Bobbie-Jo Webb-Robertson, Rick Stevens, Mark Anderson, Ken Kreutz-Delgado, Michael W. Mahoney, James B. Brown,, Learning from Learning Machines: a New Generation of AI Technology to Meet the Needs of Science, arXiv preprint arXiv:2111.13786, November 27, 2021,

John Bachan, Jianlan Ye, Xuan Jiang, Tan Nguyen, Mahesh Natarajan, Maximilian Bremer, Cy Chan, "Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++", In 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS ’24), June 24, 2024,

Maximilian Bremer, Nirmalendu Patra, Tan Nguyen, Dilip Vasudevan, Cy Chan, "Benefits of Optimistic Parallel Discrete Event Simulation for Network-on-Chip Simulation", 2023 IEEE/ACM 27th International Symposium on Distributed Simulation and Real Time Applications (DS-RT), Singapore, October 2, 2023, doi: 10.1109/DS-RT58998.2023.00013

Maximilian Bremer, John Bachan, Cy Chan, Clint Dawson, "Adaptive total variation stable local timestepping for conservation laws", Journal of Computational Physics, April 21, 2022,

Md Abdul M Faysal, Shaikh Arifuzzaman, Cy Chan, Maximilian Bremer, Doru Popovici, John Shalf, "HyPC-Map: A Hybrid Parallel Community Detection Algorithm Using Information-Theoretic Approach", HPEC, September 20, 2021,

Maximilian Bremer, John Bachan, Cy Chan, and Clint Dawson, "Speculative Parallel Execution for Local Timestepping", 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, May 21, 2021,

O Selvitopi, B Brock, I Nisa, A Tripathy, K Yelick, A Buluç, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication", Proceedings of the International Conference on Supercomputing, January 2021, 431--442, doi: 10.1145/3447818.3461472

Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, Ariful Azad, "Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale", 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, doi: 10.1109/IPDPS49936.2021.00018

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

O Selvitopi, B Brock, I Nisa, A Tripathy, K Yelick, A Buluç, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication", Proceedings of the International Conference on Supercomputing, January 2021, 431--442, doi: 10.1145/3447818.3461472

G Guidi, M Ellis, A Buluç, K Yelick, D Culler, "10 years later: Cloud computing is closing the performance gap", ICPE 2021 - Companion of the ACM/SPEC International Conference on Performance Engineering, January 1, 2021, 41--48, doi: 10.1145/3447545.3451183

Meriam Gay Bautista, Zhi Jackie Yao, Anastasiia Butko, Mariam Kiran, Mekena Metcalf, "Towards Automated Superconducting Circuit Calibration using Deep Reinforcement Learning", 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Tampa, FL, USA, IEEE, August 23, 2021, pp. 462-46, doi: 10.1109/ISVLSI51109.2021.00091

George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021,

Wei Zhang, Khaled Ibrahim, Suren Byna, "Optimizing Distributed Object Storage I/O for Large-scale Parallel GNN Training on Atomistic Graphs", UnderReview, July 11, 2025,

Suben Kumar Saha, Houjun Tang, Wei Zhang, Suren Byna, "Distributed Metadata Querying on HPC Systems", Under Review, July 10, 2025,

D. Sung, S. Kim, S. Lee, H. Tang, A. Sim, K. Wu, S. Byna, Y. Son, "Regen: An Object Layout Regenerator on Large-Scale Production HPC Systems", Future Generation Computer Systems, 2025, 107830, doi: 10.1016/j.future.2025.107830

Hyunju Oh, Wei Zhang, Christopher D. Rickett, Sreenivas R. Sukumar, Suren Byna, "Evaluating Performance Trade-offs of Caching Strategies for AI-Powered Querying Systems", 2024 IEEE International Conference on Big Data (IEEE BigData 2024), Washington DC, USA, 2024, doi: 10.1109/BigData62323.2024.10825819

Download File: Evaluating_Performance_Trade-offs_of_Caching_Strategies_for_AI-Powered_Querying_Systems.pdf (pdf: 6.6 MB)

With the rapid growth of accumulated data from

various scientific domains, traditional data management systems

face challenges in supporting complicated queries, such as pattern

search, on massive amounts of data. To serve sophisticated

queries through capturing precise features from data, recent

data management systems seek to use artificial intelligence

(AI) within the querying process. However, the characteristic

of AI inference workflow within the querying process, such as

intensive computation and expensive requirements for computing

resources, becomes a bottleneck of the AI-powered query systems.

In this paper, we provide a generalization of AI inference

workflow in the context of AI-powered data discovery and we

introduce three different caching strategies corresponding to

each stage in the AI inference workflow. We provide in-depth

performance evaluation on the impact of these caching strategies

through a series of strong scaling experiments. Our experimental

results show that the AI-powered data querying performance can

be significantly improved by applying different caching strategies.

M Scot Breitenfeld, Houjun Tang, Huihuo Zheng, Jordan Henderson, Suren Byna, "HDF5 in the Exascale Era: Delivering Efficient and Scalable Parallel I/O for Exascale Applications", The International Journal of High Performance Computing Applications, October 16, 2024, doi: 10.1177/10943420241288244

Hammad Ather, Jean Luca Bez, Yankun Xia, Suren Byna, "Drilling Down I/O Bottlenecks with Cross-layer I/O Profile Exploration", 38th IEEE International Parallel & Distributed Processing Symposium, San Francisco, CA, USA, May 27, 2024,

Neeraj Rajesh, Keith Bateman, Jean Luca Bez, Suren Byna, Anthony Kougkas, Xian-He Sun, "TunIO: An AI-powered Framework for Optimizing HPC I/O", 38th IEEE International Parallel & Distributed Processing Symposium, San Fransicso, CA, US, May 27, 2024,

D.K. Sung, Y. Son, A. Sim, K. Wu, S. Byna, H. Tang, H. Eom, C. Kim, S. Kim, "A2FL: Autonomous and Adaptive File Layout in HPC through Real-time Access Pattern Analysis", 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS2024), 2024,

Wei Zhang, Houjun Tang, Suren Byna, "IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management", The 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing. Philadelphia, 2024 (CCGrid 2024), Philadelphia, PA, USA, IEEE, May 9, 2024, doi: 10.1109/CCGrid59990.2024.00072

Download File: 956600a598.pdf (pdf: 782 KB)

Jean Luca Bez, Houjun Tang, Scot Breitenfeld, Huihuo Zheng, Wei-Keng Liao, Kaiyuan Hou, Zanhua Huang, Suren Byna, "h5bench: Exploring HDF5 Access Patterns Performance in Pre-Exascale Platforms", Concurrency and Computation: Practice and Experience (CCPE), January 31, 2024,

Chenxu Niu, Wei Zhang, Suren Byna, Yong Chen, "PSQS: Parallel Semantic Querying Service for Self-describing File Formats", 2023 IEEE International Conference on Big Data (BigData), December 1, 2023, doi: 10.1109/BigData59044.2023.10386205

Jean Luca Bez, Suren Byna, Shadi Ibrahim, "I/O Access Patterns in HPC Applications: A 360-Degree Survey", ACM Computing Surveys, September 15, 2023, 56, doi: 10.1145/3611007

Bin Dong, Jean Luca Bez, Suren Byna, "AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis.", In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’23), June 16, 2023,

Download File: IODiagnose-final.pdf (pdf: 1.9 MB)

Hammad Ather, Jean Luca Bez, Boyana Norris, Suren Byna, "Illuminating the I/O Optimization Path of Scientific Applications", High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings, Hamburg, Germany, Springer-Verlag, May 21, 2023, 22–41, doi: https://doi.org/10.1007/978-3-031-32041-5_2

The existing parallel I/O stack is complex and difficult to tune due to the interdependencies among multiple factors that impact the performance of data movement between storage and compute systems. When performance is slower than expected, end-users, developers, and system administrators rely on I/O profiling and tracing information to pinpoint the root causes of inefficiencies. Despite having numerous tools that collect I/O metrics on production systems, it is not obvious where the I/O bottlenecks are (unless one is an I/O expert), their root causes, and what to do to solve them. Hence, there is a gap between the currently available metrics, the issues they represent, and the application of optimizations that would mitigate performance slowdowns. An I/O specialist often checks for common problems before diving into the specifics of each application and workload. Streamlining such analysis, investigation, and recommendations could close this gap without requiring a specialist to intervene in every case. In this paper, we propose a novel interactive, user-oriented visualization, and analysis framework, called Drishti. This framework helps users to pinpoint various root causes of I/O performance problems and to provide a set of actionable recommendations for improving performance based on the observed characteristics of an application. We evaluate the applicability and correctness of Drishti using four use cases from distinct science domains and demonstrate its value to end-users, developers, and system administrators when seeking to improve an application’s I/O performance.

Md Kamal Hossain Chowdhury, Houjun Tang, Jean Luca Bez, Purushotham V. Bangalore, Suren Byna, "Efficient Asynchronous I/O with Request Merging", 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), St. Petersburg, FL, USA, IEEE, 2023, 628-636, doi: 10.1109/IPDPSW59300.2023.00107

Hammad Ather, Jean Luca Bez, Boyana Norris, Suren Byna, "Illuminating the I/O Optimization Path of Scientific Applications", International Conference on High Performance Computing (ISC'23), Springer Nature Switzerland, May 10, 2023, 22-41, doi: https://doi.org/10.1007/978-3-031-32041-5_2

S. Kim, A. Sim, K. Wu, S. Byna, Y. Son, H. Eom, "Design and Implementation of I/O Performance Prediction Scheme on HPC Systems through Large-scale Log Analysis", Journal of Big Data, 2023, 10(65), doi: 10.1186/s40537-023-00741-4

Jean Luca Bez, Hammad Ather, Suren Byna, "Drishti: Guiding End-Users in the I/O Optimization Journey", PDSW 2022, held in conjunction with SC22, 2022,

Chenxu Niu, Wei Zhang, Suren Byna, Yong Chen, "Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes", 2022 IEEE Conference on High Performance Extreme Computing (HPEC), September 19, 2022, doi: 10.1109/HPEC55821.2022.9926389

Jean Luca Bez, Suren Byna, April 2019 Darshan counters from the Cori supercomputer [Data set], Zenodo, 2022, doi: 10.5281/zenodo.6476501

Sunggon Kim, Alex Sim, Kesheng Wu, Suren Byna, Yongseok Son, "Design and implementation of dynamic I/O control scheme for large scale distributed file systems", Cluster Computing, 2022, 25(6):1--16, doi: 10.1007/s10586-022-03640-0

Download File: wu2022.bib (bib: 22 KB)

Jean Luca Bez, Ahmad Maroof Karimi, Arnab K. Paul, Bing Xie, Suren Byna, Philip Carns, Sarp Oral, Feiyi Wang, Jesse Hanley, "Access Patterns and Performance Behaviors of Multi-layer Supercomputer I/O Subsystems under Production Load", 31st International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '22), Association for Computing Machinery, June 27, 2022, 43–55, doi: 10.1145/3502181.3531461

D. Bard, C. Snavely, L. Gerhardt, J. Lee, B. Totzke, K. Antypas, W. Arndt, J. Blaschke, S. Byna, R. Cheema, S. Cholia, M. Day, B. Enders, A. Gaur, A. Greiner, T. Groves, M. Kiran, Q. Koziol, T. Lehman, K. Rowland, C. Samuel, A. Selvarajan, A. Sim, D. Skinner, L. Stephey, R. Thomas, G. Torok, "LBNL Superfacility Project Report", Lawrence Berkeley National Laboratory, 2022, doi: 10.48550/arXiv.2206.11992

Jean Luca Bez, Suren Byna, Understanding I/O Behavior with Interactive Darshan Log Analysis, Exascale Computing Project (ECP) Community Days BoF, 2022,

Huihuo Zheng, Venkatram Vishwanath, Quincey Koziol, Houjun Tang, John Ravi, John Mainzer, Suren Byna, "HDF5 Cache VOL: Efficient and scalable parallel I/O through caching data on node-local storage", 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 16, 2022, doi: 10.1109/CCGrid54584.2022.00015

Houjun Tang, Quincey Koziol, John Ravi, and Suren Byna,, "Transparent Asynchronous Parallel I/O using Background Threads", IEEE Transactions on Parallel and Distributed Systems, April 4, 2022, 33, doi: 10.1109/TPDS.2021.3090322

Qiao Kang, Scot Breitenfeld, Kaiyuan Hou, Wei-keng Liao, Robert Ross, and Suren Byna,, "Optimizing Performance of Parallel I/O Accesses to Non-contiguous Blocks in Multiple Array Variables", IEEE BigData 2021 conference, December 19, 2021,

J. Bang, C. Kim, K. Wu, A. Sim, S. Byna, H. Sung, H. Eom, "An In-Depth I/O Pattern Analysis in HPC Systems", IEEE International Conference on High Performance Computing, Data & Analytics (HiPC2021), 2021, doi: 10.1109/HiPC53243.2021.00056

Wei Zhang, Suren Byna, Hyogi Sim, Sangkeun Lee, Sudharshan Vazhkudai, and Yong Chen,, "Exploiting User Activeness for Data Retention in HPC Systems", International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21), November 21, 2021, doi: https://doi.org/10.1145/3458817.3476201

Download File: 3458817.3476201-2.pdf (pdf: 1.5 MB)

Cong Xu, Suparna Bhattacharya, Martin Foltin, Suren Byna, and Paolo Faraboschi, "Data-Aware Storage Tiering for Deep Learning", 6th International Parallel Data Systems Workshop (PDSW) 2021, held in conjunction with SC21, November 21, 2021,

Houjun Tang, Bing Xie, Suren Byna, Phillip Carns, Quincey Koziol, Sudarsun Kannan, Jay Lofstead, and Sarp Oral,, "SCTuner: An Auto-tuner Addressing Dynamic I/O Needs on Supercomputer I/O Sub-systems", 6th International Parallel Data Systems Workshop (PDSW) 2021, held in conjunction with SC21, November 21, 2021,

Bo Fang, Daoce Wang, Sian Jin, Quincey Koziol, Zhao Zhang, Qiang Guan, Suren Byna, Sriram Krishnamoorthy, and Dingwen Tao,, "Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights", IEEE Cluster 2021, September 1, 2021,

Suren Byna, Houjun Tang, and Quincey Koziol,, Automatic and Transparent Scientific Data Management with Object Abstractions, PASC 2021, in a Minisymposium on "Data Movement Orchestration on HPC Systems", July 31, 2021,

Bing Xie, Houjun Tang, Suren Byna, Jesse Hanley, Quincey Koziol, Tonglin Li, Sarp Oral,, "Battle of the Defaults: Extracting Performance Characteristics of HDF5 under Production Load", CCGrid 2021, May 31, 2021,

Jean Luca Bez, Houjun Tang, Bing Xie, David Williams-Young, Rob Latham, Rob Ross, Sarp Oral, Suren Byna, "I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis", 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW), January 1, 2021, 15-22, doi: 10.1109/PDSW54622.2021.00008

Tonglin Li, Suren Byna, Quincey Koziol, Houjun Tang, Jean Luca Bez, Qiao Kang, "h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns", Cray User Group (CUG) 2021, January 1, 2021,

Paolo Calafiura and others, Artificial Intelligence for High Energy Physics, edited by Paolo Calafiura, David Rousseau, Kazuhiro Terao, (World Scientific: March 1, 2022) doi: 10.1142/12200

John Wu, Ben Brown, Paolo Calafiura, Quincey Koziol, Dongeun Lee, Alex Sim, Devesh Tiwari, Support for In-Flight Data Analyses in Scientific Workflows, DOE ASCR Workshop on the Management and Storage of Scientific Data, 2022, doi: 10.2172/1843500

Alina Lazar, others, Accelerating the Inference of the Exa.TrkX Pipeline, 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research: AI Decoded - Towards Sustainable, Diverse, Performant and Effective Scientific Computing, 2022,

Chun-Yi Wang, others, Reconstruction of Large Radius Tracks with the Exa.TrkX pipeline, 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research: AI Decoded - Towards Sustainable, Diverse, Performant and Effective Scientific Computing, 2022,

Sunanda Banerjee, others, Detector and Beamline Simulation for Next-Generation High Energy Physics Experiments, 2022 Snowmass Summer Study, 2022,

Meghna Bhattacharya, others, Portability: A Necessary Approach for Future Scientific Software, 2022 Snowmass Summer Study, 2022,

Christopher D. Jones, Kyle Knoepfel, Paolo Calafiura, Charles Leggett, Vakhtang Tsulaia, Evolution of HEP Processing Frameworks, 2022 Snowmass Summer Study, 2022,

Savannah Thais, Paolo Calafiura, Grigorios Chachamis, Gage DeZoort, Javier Duarte, Sanmay Ganguly, Michael Kagan, Daniel Murnane, Mark S. Neubauer, Kazuhiro Terao, Graph Neural Networks in Particle Physics: Implementations, Innovations, and Challenges, 2022 Snowmass Summer Study, 2022,

Xiangyang Ju, others, Performance of a geometric deep learning pipeline for HL-LHC particle tracking, Eur. Phys. J. C, Pages: 876 2021, doi: 10.1140/epjc/s10052-021-09675-8

E. Wes Bethel, Burlen Loring, Utkarsh Ayatchit, David Camp, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, Kesheng Wu, "The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_13

Jan Balewski, Mercy G Amankwah, Roel Van Beeumen, E Wes Bethel, Talita Perciano, Daan Camps, "Quantum-parallel vectorized data encodings and computations on trapped-ion and transmon QPUs", Journal, February 10, 2024, 14, doi: https://doi.org/10.1038/s41598-024-53720-x

Daan Camps, Lin Lin, Roel Van Beeumen, Chao Yang, "Explicit quantum circuits for block encodings of certain sparse matrices", SIAM Journal on Matrix Analysis and Applications, January 1, 2024, 45:801-827, doi: 10.1137/22M1484298

E Wes Bethel, Mercy G Amankwah, Jan Balewski, Roel Van Beeumen, Daan Camps, Daniel Huang, Talita Perciano, "Quantum computing and visualization: A disruptive technological change ahead", Journal, November 6, 2023, 43, doi: https://doi.org/10.1109/MCG.2023.3316932

M. G. Amankwah, D. Camps, E. W. Bethel, R. Van Beeumen, T. Perciano, "Quantum pixel representations and compression for N-dimensional images", Nature Scientific Reports, May 11, 2022, 12:7712, doi: 10.1038/s41598-022-11024-y

Thijs Steel, Daan Camps, Karl Meerbergen, Raf Vandebril, "A Multishift, Multipole Rational QZ Method with Aggressive Early Deflation", SIAM Journal on Matrix Analysis and Applications, February 19, 2021, 42:753-774, doi: 10.1137/19M1249631

In the article “A Rational QZ Method” by D. Camps, K. Meerbergen, and R. Vandebril [SIAM J. Matrix Anal. Appl., 40 (2019), pp. 943--972], we introduced rational QZ (RQZ) methods. Our theoretical examinations revealed that the convergence of the RQZ method is governed by rational subspace iteration, thereby generalizing the classical QZ method, whose convergence relies on polynomial subspace iteration. Moreover the RQZ method operates on a pencil more general than Hessenberg---upper triangular, namely, a Hessenberg pencil, which is a pencil consisting of two Hessenberg matrices. However, the RQZ method can only be made competitive to advanced QZ implementations by using crucial add-ons such as small bulge multishift sweeps, aggressive early deflation, and optimal packing. In this paper we develop these techniques for the RQZ method. In the numerical experiments we compare the results with state-of-the-art routines for the generalized eigenvalue problem and show that the presented method is competitive in terms of speed and accuracy.

John Bachan, Jianlan Ye, Xuan Jiang, Tan Nguyen, Mahesh Natarajan, Maximilian Bremer, Cy Chan, "Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++", In 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS ’24), June 24, 2024,

Maximilian Bremer, Nirmalendu Patra, Tan Nguyen, Dilip Vasudevan, Cy Chan, "Benefits of Optimistic Parallel Discrete Event Simulation for Network-on-Chip Simulation", 2023 IEEE/ACM 27th International Symposium on Distributed Simulation and Real Time Applications (DS-RT), Singapore, October 2, 2023, doi: 10.1109/DS-RT58998.2023.00013

Maximilian Bremer, John Bachan, Cy Chan, Clint Dawson, "Adaptive total variation stable local timestepping for conservation laws", Journal of Computational Physics, April 21, 2022,

Md Abdul M Faysal, Shaikh Arifuzzaman, Cy Chan, Maximilian Bremer, Doru Popovici, John Shalf, "HyPC-Map: A Hybrid Parallel Community Detection Algorithm Using Information-Theoretic Approach", HPEC, September 20, 2021,

Serges Love Teutu Talla, Isabelle Kemajou-Brown, Cy Chan, Bin Wang, "A Binary Multi-Subsystems Transportation Networks Estimation using Mobiliti Data", 2021 American Control Conference (ACC), May 25, 2021,

Maximilian Bremer, John Bachan, Cy Chan, and Clint Dawson, "Speculative Parallel Execution for Local Timestepping", 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, May 21, 2021,

D. A. Agarwal, J. Damerow, C. Varadharajan, D. S. Christianson, G. Z. Pastorello, Y.-W. Cheah, L. Ramakrishnan, "Balancing the needs of consumers and producers for scientific data collections", Ecological Informatics, 2021, 62:101251, doi: 10.1016/j.ecoinf.2021.101251

Daniel Arnold, Sy-Toan Ngo, Ciaran Roberts, Yize Chen, Anna Scaglione, Sean Peisert, "Adam-based Augmented Random Search for Control Policies for Distributed Energy Resource Cyber Attack Mitigation", Proceedings of the 2022 American Control Conference (ACC), June 2022,

Yize Chen, Yuanyuan Shi, Daniel Arnold, Sean Peisert, SAVER: Safe Learning-Based Controller for Real-Time Voltage Regulation, arXiv preprint arXiv:2111.15152,, November 30, 2021,

Yize Chen, Daniel Arnold, Yuanyuan Shi, Sean Peisert, Understanding the Safety Requirements for Learning-based Power Systems Operations, arXiv preprint arXiv:2110.04983, October 11, 2021,

Hao Li, Han Cai, Joseph Forman, Ran Cheng, et al., "Transport Properties of NbN Thin Films Patterned With a Focused Helium Ion Beam", IEEE Transactions on Applied Superconductivity, August 2023,

Ran Cheng, Christoph Kirst, Dilip Vasudevan, "Superconducting-Oscillatory Neural Network With Pixel Error Detection for Image Recognition", IEEE Transaction on Applied Superconductivity, August 2023, 33:1-7,

Ran Cheng, Uday S. Goteti, Harrison Walker, Keith M. Krause, Luke Oeding, Michael C. Hamilton, "Toward Learning in Neuromorphic Circuits Based on Quantum Phase Slip Junctions", Frontiers in Neuroscience, November 8, 2021,

Ran Cheng, Uday S. Goteti, Michael C. Hamilton, "High-Speed and Low-Power Superconducting Neuromorphic Circuits Based on Quantum Phase-Slip Junctions", IEEE Transactions on Applied Superconductivity, August 2021,

D. Bard, C. Snavely, L. Gerhardt, J. Lee, B. Totzke, K. Antypas, W. Arndt, J. Blaschke, S. Byna, R. Cheema, S. Cholia, M. Day, B. Enders, A. Gaur, A. Greiner, T. Groves, M. Kiran, Q. Koziol, T. Lehman, K. Rowland, C. Samuel, A. Selvarajan, A. Sim, D. Skinner, L. Stephey, R. Thomas, G. Torok, "LBNL Superfacility Project Report", Lawrence Berkeley National Laboratory, 2022, doi: 10.48550/arXiv.2206.11992

MB Simmonds, WJ Riley, DA Agarwal, X Chen, S Cholia, R Crystal-Ornelas, ET Coon, D Dwivedi, VC Hendrix, M Huang, A Jan, Z Kakalia, J Kumar, CD Koven, L Li, M Melara, L Ramakrishnan, DM Ricciuto, AP Walker, W Zhi, Q Zhu, C Varadharajan, Guidelines for Publicly Archiving Terrestrial Model Data to Enhance Usability, Intercomparison, and Synthesis, Data Science Journal, 2022, doi: 10.5334/dsj-2022-003

Xiangyang Ju, others, Performance of a geometric deep learning pipeline for HL-LHC particle tracking, Eur. Phys. J. C, Pages: 876 2021, doi: 10.1140/epjc/s10052-021-09675-8

H Weierbach, AR Lima, JD Willard, VC Hendrix, DS Christianson, M Lubich, C Varadharajan, Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning, Water (Switzerland), 2022, doi: 10.3390/w14071032

C Varadharajan, AP Appling, B Arora, DS Christianson, VC Hendrix, V Kumar, AR Lima, J Müller, S Oliver, M Ombadi, T Perciano, JM Sadler, H Weierbach, JD Willard, Z Xu, J Zwart, "Can machine learning accelerate process understanding and decision-relevant predictions of river water quality?", Hydrological Processes, January 1, 2022, 36, doi: 10.1002/hyp.14565

C Varadharajan, VC Hendrix, DS Christianson, M Burrus, C Wong, SS Hubbard, DA Agarwal, BASIN-3D: A brokering framework to integrate diverse environmental data, Computers and Geosciences, 2022, doi: 10.1016/j.cageo.2021.105024

C Varadharajan, Z Kakalia, E Alper, EL Brodie, M Burrus, RWH Carroll, D Christianson, W Dong, V Hendrix, M Henderson, S Hubbard, D Johnson, R Versteeg, KH Williams, DA Agarwal, The Colorado East River Community Observatory Data Collection, Hydrological Processes 35(6), 2021, doi: 10.22541/au.161962485.54378235/v1

D. A. Agarwal, J. Damerow, C. Varadharajan, D. S. Christianson, G. Z. Pastorello, Y.-W. Cheah, L. Ramakrishnan, "Balancing the needs of consumers and producers for scientific data collections", Ecological Informatics, 2021, 62:101251, doi: 10.1016/j.ecoinf.2021.101251

Ammar Haydari, Michael Zhang, Chen-Nee Chuah, Jane Macfarlane, Sean Peisert, Adaptive Differential Privacy Mechanism for Aggregated Mobility Dataset, arXiv preprint arXiv:2112.08487, December 10, 2021,

Yang Liu, Pieter Ghysels, Lisa Claus, Xiaoye Sherry Li, "Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations", SIAM J. Sci. Comput., June 22, 2021,

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Julian Bellavita, Mathias Jacquelin, Esmond G. Ng, Dan Bonachea, Johnny Corbino, Paul H. Hargrove, "symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver", 2023 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'23), ACM, November 13, 2023, doi: 10.1145/3624062.3624600

Sparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Johnny Corbino, UPC++’s Crucial Role in Quantum Chemistry, UPC++ Community BOF Virtual Symposium, February 16, 2023, doi: 10.25344/S4XG6F

Paul H. Hargrove, Dan Bonachea, Johnny Corbino, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'23)", Poster at Exascale Computing Project (ECP) Annual Meeting 2023, January 2023,

The Pagoda project is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. The first component is GASNet-EX, a portable, high-performance, global-address-space communication library. The second component is UPC++, a C++ template library. Together, these libraries enable agile, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

GASNet-EX is a portable, high-performance communications middleware library which leverages hardware support to implement Remote Memory Access (RMA) and Active Message communication primitives. GASNet-EX supports a broad ecosystem of alternative HPC programming models, including UPC++, Legion, Chapel and multiple implementations of UPC and Fortran Coarrays. GASNet-EX is implemented directly over the native APIs for networks of interest in HPC. The tight semantic match of GASNet-EX APIs to the client requirements and hardware capabilities often yields better performance than competing libraries.

UPC++ provides high-level productivity abstractions appropriate for Partitioned Global Address Space (PGAS) programming such as: remote memory access (RMA), remote procedure call (RPC), support for accelerators (e.g. GPUs), and mechanisms for aggressive asynchrony to hide communication costs. UPC++ implements communication using GASNet-EX, delivering high performance and portability from laptops to exascale supercomputers. HPC application software using UPC++ includes: MetaHipMer2 metagenome assembler, SIMCoV viral propagation simulation, NWChemEx TAMM, and graph computation kernels from ExaGraph.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Sean R Miller, Matthew Schipper, Lars G Fritsche, Ralph Jiang, Garth Strohbehn, Erkin Ötleş, Benjamin H McMahon, Silvia Crivelli, Rafael Zamora‐Resendiz, Nithya Ramnath, Shinjae Yoo, Xin Dai, Kamya Sankar, Donna M Edwards, Steven G Allen, Michael D Green, Alex K Bryant, "Pan‐Cancer Survival Impact of Immune Checkpoint Inhibitors in a National Healthcare System", November 7, 2024,

Alex K Bryant, Rafael Zamora‐Resendiz, Xin Dai, Destinee Morrow, Yuewei Lin, Kassidy M Jungles, James M Rae, Akshay Tate, Ashley N Pearson, Ralph Jiang, Lars Fritsche, Theodore S Lawrence, Weiping Zou, Matthew Schipper, Nithya Ramnath, Shinjae Yoo, Silvia Crivelli, Michael D Green, "Artificial intelligence to unlock real‐world evidence in clinical oncology: A primer on recent advances", Cancer Medicine, June 20, 2024, doi: https://doi.org/10.1002/cam4.7253

Sayera Dhaubhadel, Kumkum Ganguly, Ruy M Ribeiro, Judith D Cohn, James M Hyman, Nicolas W Hengartner, Beauty Kolade, Anna Singley, Tanmoy Bhattacharya, Patrick Finley, Drew Levin, Haedi Thelen, Kelly Cho, Lauren Costa, Yuk-Lam Ho, Amy C Justice, John Pestian, Daniel Santel, Rafael Zamora-Resendiz, Silvia Crivelli, Suzanne Tamang, Susana Martins, Jodie Trafton, David W Oslin, Jean C Beckham, Nathan A Kimbrel, Benjamin H McMahon, "High dimensional predictions of suicide risk in 4.2 million US Veterans using ensemble transfer learning", scientific reports, January 20, 2024,

Rafael Zamora-Resendiz , David W. Oslin, Dina Hooshyar, Silvia Crivelli, "Using Electronic Health Record Metadata to Predict Housing Instability amongst Veterans", Preventive Medicine Reports, November 7, 2023,

Nathan A. Kimbrel, Allison E. Ashley-Koch, Xue J. Qin, Jennifer H. Lindquist, Melanie E. Garrett, Michelle F. Dennis, Lauren P. Hair, Jennifer E. Huffman, Daniel A. Jacobson, Ravi K. Madduri, Jodie A. Trafton, Hilary Coon, Anna R. Docherty, Niamh Mullins, Douglas M. Ruderfer, Philip D. Harvey, Benjamin H. McMahon, David W. Oslin, Jean C. Beckham, Elizabeth R. Hauser, Michael A. Hauser, Million Veteran Program Suicide Exemplar Workgroup, International Suicide Genetics Consortium, Veterans Affairs Mid-Atlantic Mental Illness Research Education and Clinical Center Workgroup, Veterans Affairs Million Veteran Program, "Identification of Novel, Replicable Genetic Risk Loci for Suicidal Thoughts and Behaviors Among US Military Veterans", JAMA Psychiatry, February 1, 2023, 80:100-191, doi: 10.1001/jamapsychiatry.2022.3896

Destinee Morrow, Rafael Zamora-Resendiz, Jean C Beckham, Nathan A Kimbrel, David W Oslin, Suzanne Tamang, Million Veteran Program Suicide Exemplar Workgroup, Silvia Crivelli, "A case for developing domain-specific vocabularies for extracting suicide factors from healthcare notes", Journal of Psychiatric Research, July 1, 2022, 151:328-338, doi: 10.1016/j.jpsychires.2022.04.009

Xuan Jiang, Raja Sengupta, James Demmel, Samuel Williams, "Large scale multi-GPU based parallel traffic simulation for accelerated traffic assignment and propagation", Transportation Research Part C: Emerging Technologies, December 2024, 169:104873, doi: 10.1016/j.trc.2024.104873

Y. Cho, J. W. Demmel, X. S. Li, Y. Liu, H. Luo, "Enhancing autotuning capability with a history database", IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), December 20, 2021,

Download File: GPTuneHistoryDB.pdf (pdf: 390 KB)

H. Luo, J.W. Demmel, Y. Cho, X. S. Li, Y. Liu, "Non-smooth Bayesian optimization in tuning problems", arxiv-preprint, September 21, 2021,

Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, X. S. Li, "GPTune: multitask learning for autotuning exascale applications", PPoPP, February 17, 2021, doi: 10.1145/3437801.3441621

Abdullah Alperen, Nan Ding, Khaled Z. Ibrahim, Pieter Maris, Leonid Oliker, Chao Yang, Hasan Metin Aktulga, "Optimizing Nuclear Configuration Interaction Calculations on GPUs: A Comparative Performance Study of Programming Models", https://isc.app.swapcard.com/event/isc-high-performance-2025/planning/UGxhbm5pbmdfMjU4OTMyNg==, June 12, 2025,

Download File: ISC25_MFDn_opt.pdf (pdf: 7.7 MB)

Nan Ding, Oscar Antepara, Zhengji Zhao, Brian Austin, Leonid Oliker, Nicholas J. Wright, Samuel Williams, "Maximizing Power-Constrained Supercomputing Throughput", ISC'25, June 11, 2025,

Download File: ISC25_GPU_Power_Cap.pdf (pdf: 5.2 MB)

LeAnn M. Lindsey, Nan Ding, Jack Deslippe, Muaaz Awan, "Performance Modeling and Analysis of a de Bruijn Graph Based Local Assembly Kernel on Multiple Vendor GPUs", P3HPC'24, November 17, 2024,

Download File: SCW63240.2024.00150.pdf (pdf: 541 KB)

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, Evaluating the Performance of One-sided Communication on CPUs and GPUs, 2023 International Workshop on Performance, Portability & Productivity in HPC, November 13, 2023,

Download File: ws_p3hpc112.pdf (pdf: 4.7 MB)

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, "Evaluating the Performance of One-sided Communication on CPUs and GPUs", 2023 International Workshop on Performance, Portability & Productivity in HPC, November 12, 2023,

Download File: OneSided_MPI_P3HPC_.pdf (pdf: 2.5 MB)

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", PMBS, November 2022,

Download File: PMBS22_GPU_final.pdf (pdf: 719 KB)

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Methodology for Evaluating the Potential of Disaggregated Memory Systems, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: RESDIS22_Disaggregated_memory_Nan.pdf (pdf: 3.8 MB)

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

Nan Ding, Muaaz Awan, Samuel Williams, "Instruction Roofline: An insightful visual performance model for GPUs", CCPE, August 4, 2021, doi: 10.1002/cpe.6591

Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver, July 19, 2021,

Download File: multiGPU_SpTRSV_ACDA21-v2.pdf (pdf: 3.7 MB)

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

MG Awan, S Hofmeyr, R Egan, N Ding, A Buluc, J Deslippe, L Oliker, K Yelick, "Accelerating Large Scale de novo Metagenome Assembly Using GPUs", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2021, doi: 10.1145/3458817.3476212

Adrián P. Diéguez, Margarita Amor, Ramón Doallo, Akira Nukada, Satoshi Matsuoka, "Efficient high-precision integer multiplication on the GPU", The International Journal of High Performance Computing Applications, March 2022, 36:356-369, doi: 10.1177/10943420221077964

B. Dong, A. Nayak, K. Wu, V. Tribaldos, J. Ajo-Franklin, Q. Zhang, S. Byna, F. Guo, P. Dobson, A. Sim, "TensorSearch: Parallel Similarity Search on Tensors", IEEE International Conference on Big Data (BigData), 2024,

Download File: TensorSearch-final-version-paper.pdf (pdf: 6.2 MB)

Bin Dong, Kesheng Wu, Suren Byna, "The Art of Sparsity: Mastering High-Dimensional Tensor Storage", 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 27, 2024,

Download File: sci_data_sparse_update.pdf (pdf: 473 KB)

R. Han, M, Zheng, S. Byna, H. Tang, B. Dong, D. Dai, Y. Chen, D. Kim, J. Hassoun, D. Thorsley, M. Wolf, "PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems", IEEE Transactions on Parallel and Distributed Systems, March 14, 2024,

Bin Dong, Jean Luca Bez, Suren Byna, "AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis.", In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’23), June 16, 2023,

Download File: IODiagnose-final.pdf (pdf: 1.9 MB)

Bin Dong, Alex Popescu, Veronica Rodriguez Tribaldos, Suren Byna, Jonathan Ajo-Franklin, Kesheng Wu, "Real-time and post-hoc compression for data from Distributed Acoustic Sensing", Computers \& Geosciences, June 24, 2022, 105181,

Download File: wu2022.bib (bib: 22 KB)

Jonathan Ajo‐Franklin, Verónica Rodríguez Tribaldos, Avinash Nayak, Feng Cheng, Robert Mellors, Benxin Chi, Todd Wood, Michelle Robertson, Cody Rotermund, Eric Matzel, Dennise C. Templeton, Christina Morency, Kesheng Wu, Bin Dong, Patrick Dobson;, "The Imperial Valley Dark Fiber Project: Toward Seismic Studies Using DAS and Telecom Infrastructure for Geothermal Applications", Seismological Research Letters, June 24, 2022,

Runzhou Han, Suren Byna, Houjun Tang, Bin Dong, and Mai Zheng,, "PROV-IO: An I/O-Centric Provenance Framework for Scientific Data on HPC Systems", HPDC 2022, June 23, 2022,

John Wu, Bin Dong, Alex Sim, Automating Data Management Through Unified Runtime Systems, DOE ASCR Workshop on the Management and Storage of Scientific Data, 2022, doi: 10.2172/1843500

Bin Dong, Kesheng Wu, Suren Byna, User-Defined Tensor Data Analysis, SpringerBrief, (January 1, 2022)

Screen Shot 2022 06 24 at 1.24.03 PM

V. Dumont, C. Garner, A. Trivedi, C. Jones, V. Ganapati, J. Mueller, T. Perciano, M. Kiran, and M. Day, "HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization", 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), November 15, 2021,

Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan, "Experiences with Reproducibility: Case Studies from Scientific Workflows", (P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, ACM, June 21, 2021, doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

G Guidi, M Ellis, A Buluç, K Yelick, D Culler, "10 years later: Cloud computing is closing the performance gap", ICPE 2021 - Companion of the ACM/SPEC International Conference on Performance Engineering, January 1, 2021, 41--48, doi: 10.1145/3447545.3451183

Zhe Bai, Abdelilah Essiari, Talita Perciano, Kristofer E Bouchard, "AutoCT: Automated CT registration, segmentation, and quantification", Software X, February 28, 2024, 26, doi: 10.1016/j.softx.2024.101673

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Drew Paine, Sarah Poon, Michael Beach, Alpha N'Diaye, Patrick Huck, Lavanya Ramakrishnan, "Science Capsule: Towards Sharing and Reproducibility of Scientiﬁc Workﬂows", 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), November 15, 2021, doi: 10.1109/WORKS54523.2021.00014

Workflows are increasingly processing large volumes of data from scientific instruments, experiments and sensors. These workflows often consist of complex data processing and analysis steps that might include a diverse ecosystem of tools and also often involve human-in-the-loop steps. Sharing and reproducing these workflows with collaborators and the larger community is critical but hard to do without the entire context of the workflow including user notes and execution environment. In this paper, we describe Science Capsule, which is a framework to capture, share, and reproduce scientific workflows. Science Capsule captures, manages and represents both computational and human elements of a workflow. It automatically captures and processes events associated with the execution and data life cycle of workflows, and lets users add other types and forms of scientific artifacts. Science Capsule also allows users to create `workflow snapshots' that keep track of the different versions of a workflow and their lineage, allowing scientists to incrementally share and extend workflows between users. Our results show that Science Capsule is capable of processing and organizing events in near real-time for high-throughput experimental and data analysis workflows without incurring any significant performance overheads.

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Michael Beach, Drew Paine, Lavanya Ramakrishnan, "Science Capsule - Capturing the Data Life Cycle", Journal of Open Source Software, 2021, 6:2484, doi: 10.21105/joss.02484

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671

Anne M. Felden, Daniel F. Martin, Esmond G. Ng, "SUHMO: an AMR SUbglacial Hydrology MOdel v1.0", Geosci. Model Dev. Discuss., July 27, 2022,

Download File: gmd-2022-190.pdf (pdf: 5.5 MB)

Anne M. Felden, Daniel F. Martin, Esmond G. Ng, SUHMO: An SUbglacial Hydrology MOdel based on the Chombo AMR framework, American Geophysical Union Fall Meeting, December 13, 2021,

Katherine Rasmussen, Damian Rouson, Naje George, Dan Bonachea, Hussain Kadhem, Brian Friesen, "Agile Acceleration of LLVM Flang Support for Fortran 2018 Parallel Programming", Research Poster at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), November 2022, doi: 10.25344/S4CP4S

The LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “Coarray Fortran.” We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).

Extended Abstract and Poster

Video presentation

Katherine Rasmussen, Damian Rouson, Naje George, Dan Bonachea, Hussain Kadhem, Brian Friesen, "Agile Acceleration of LLVM Flang Support for Fortran 2018 Parallel Programming", Research Poster at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), November 2022, doi: 10.25344/S4CP4S

The LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “Coarray Fortran.” We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).

Extended Abstract and Poster

Video presentation

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Drew Paine, Sarah Poon, Michael Beach, Alpha N'Diaye, Patrick Huck, Lavanya Ramakrishnan, "Science Capsule: Towards Sharing and Reproducibility of Scientiﬁc Workﬂows", 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), November 15, 2021, doi: 10.1109/WORKS54523.2021.00014

Workflows are increasingly processing large volumes of data from scientific instruments, experiments and sensors. These workflows often consist of complex data processing and analysis steps that might include a diverse ecosystem of tools and also often involve human-in-the-loop steps. Sharing and reproducing these workflows with collaborators and the larger community is critical but hard to do without the entire context of the workflow including user notes and execution environment. In this paper, we describe Science Capsule, which is a framework to capture, share, and reproduce scientific workflows. Science Capsule captures, manages and represents both computational and human elements of a workflow. It automatically captures and processes events associated with the execution and data life cycle of workflows, and lets users add other types and forms of scientific artifacts. Science Capsule also allows users to create `workflow snapshots' that keep track of the different versions of a workflow and their lineage, allowing scientists to incrementally share and extend workflows between users. Our results show that Science Capsule is capable of processing and organizing events in near real-time for high-throughput experimental and data analysis workflows without incurring any significant performance overheads.

Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan, "Experiences with Reproducibility: Case Studies from Scientific Workflows", (P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, ACM, June 21, 2021, doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Michael Beach, Drew Paine, Lavanya Ramakrishnan, "Science Capsule - Capturing the Data Life Cycle", Journal of Open Source Software, 2021, 6:2484, doi: 10.21105/joss.02484

M. Wang, Y. Liu, P. Ghysels, A. C. Yucel, "VoxImp: Impedance Extraction Simulator for Voxelized Structures", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2, 2022, doi: 10.1109/TCAD.2022.3218768

X. Zhu, Y. Liu, P. Ghysels, D. Bindal, X. S. Li, "GPTuneBand: multi-task and multi-fidelity Bayesian optimization for autotuning large-scale high performance computing applications", SIAM PP, February 23, 2022,

Download File: GPTuneBand.pdf (pdf: 1.4 MB)

Yang Liu, Pieter Ghysels, Lisa Claus, Xiaoye Sherry Li, "Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations", SIAM J. Sci. Comput., June 22, 2021,

Yang Liu, Xin Xing, Han Guo, Eric Michielssen, Pieter Ghysels, Xiaoye Sherry Li, "Butterfly factorization via randomized matrix-vector multiplications", SIAM J. Sci. Comput., March 9, 2021,

Jordan A. Welsman, Gunther H. Weber, Oluwamayowa O. Amusat, Anna Giannakou, Lavanya Ramakrishnan, "Enhancing Electron Microscopy Image Classification Using Data Augmentation", SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, November 17, 2024, 64-71, doi: 10.1109/SCW63240.2024.00016

Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan, "Automated annotation of scientific texts for ML-based keyphrase extraction and validation", Database, September 27, 2024, 2024:baae093, doi: https://doi.org/10.1093/database/baae093

Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert, "Performance Analysis of Scientific Computing Workloads on General Purpose TEEs", Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IEEE, May 2021, doi: 10.1109/IPDPS49936.2021.00115

Caroline Ellis Hammond, Patricia Gonzalez-Guerrero, George Michelogiannakis, Meriam Gay Bautista, Nirmalendu Bikash, "Triangle Counting in the Temporal Domain", ISLPED: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, September 2024,

Patricia Gonzalez-Guerrero, Κylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "Towards practical superconducting accelerators for machine learning using U-SFQ", ACM Journal on Emerging Technologies in Computing Systems, April 2024,

Meriam Gay Bautista, Darren Lyles, Kylie Huch, Patricia Gonzalez-Guerrero, George Michelogiannakis, "Area Efficient Asynchronous SFQ Pulse Round-Robin Distribution Network", IEEE Transactions on Circuits and Systems I: Regular Papers, November 2023,

Kylie Huch, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Hyperdimensional Associative Memory Circuit for Scalable Machine Learning", IEEE Transactions on Applied Superconductivity, May 2023,

Patricia Gonzalez-Guerrero, Kylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "An Area Efficient Superconducting Unary CNN Accelerator", IEEE 24th International Symposium on Quality Electronic Design (ISQED), IEEE, April 2023,

Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Gay Bautista, George Michelogiannakis, "PaST-NoC: A Packet-Switched Superconducting Temporal NoC", IEEE Transactions on Applied Superconductivity, January 2023,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-Flux Shift Register for Race Logic and Its Applications", IEEE Transactions on Circuits and Systems I: Regular Papers, October 2022,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, Kylie Huch, George Michelogiannakis, "Superconducting Digital DIT Butterfly Unit for Fast Fourier Transform Using Race Logic", 2022 20th IEEE Interregional NEWCAS Conference (NEWCAS), IEEE, June 2022, 441-445,

Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, George Michelogiannakis, "Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators", 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), ACM, February 2022,

Download File: asplos2022.pdf (pdf: 1.9 MB)

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-flux Shift Buffer for Race Logic", 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), August 2021,

George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021,

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Max Grossman, Amir Kamil, Colin A. MacLean, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'21)", Poster at Exascale Computing Project (ECP) Annual Meeting 2021, April 2021,

We present UPC++ and GASNet-EX, which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Memory Access (RMA) and Remote Procedure Call (RPC). The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems

N. Podhorszki, A. Gainaru, S. Klasky, J. Gu, V. Bolea, L. Dulac, D. Ganyushin, W. Godoy, Q. Liu, C. Ross, L. Wan, S. Wittenburg, K. Wu, "HPC I/O innovations in the exascale era", International Journal of High Performance Computing Applications, 2025, doi: 10.1177/10943420251330446

Junmin Gu, John Wu, Paul Lin, CS Chang, Seong-Hoe Ku, Stephane Ethier, Jong Choi, Accurate in-situ in-transit analysis of particle diffusion for large-scale tokamak simulation, ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), 2024,

J. Gu, P. Lin, K. Wu, S.-H. Ku, C.S Chang, R. Hager, A. Scheinberg, J. Choi, "Efficient Streaming Analysis of High-Resolution Plasma Transport", 36th International Conference on Scientific and Statistical Database Management (SSDBM 2024), 2024,

J. Gu, P. Lin, K. Wu, S-H. Ku, C.S. Chang, R.M. Churchill, J. Choi, N. Podhorszki, S. Klasky, "Unraveling Diffusion in Fusion Plasma: A Case Study of In Situ Processing and Particle Sorting", In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV'23), 2023,

C.S. Chang, S-H. Ku, R. Hager, J. Choi, D. Pugmire, S. Klasky, Scott, A. Loarte, R. Pitts, J. Gu, J. Wu, The role of turbulent separatrix tangle in the improvement of the integrated pedestal/heat exhaust issue for stationary operation in ITER and Fusion Reactors, APS Division of Plasma Physics Meeting, 2023,

Lipeng Wan, Axel Huebl, Junmin Gu, Franz Poeschel, Ana Gainaru, Ruonan Wang, Jieyang Chen, Xin Liang, Dmitry Ganyushin, Todd Munson, Ian Foster, Jean-Luc Vay, Norbert Podhorszki, Kesheng Wu, Scott Klasky, "Improving I/O Performance for Exascale Applications Through Online Data Layout Reorganization", IEEE Transactions on Parallel and Distributed Systems, 2022, 33:878-890, doi: 10.1109/TPDS.2021.3100784

E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, Dave Pugmire, Silvio Rizzi, Thompson, Will Usher, Gunther H. Weber, Brad Whitlock, Wolf, Kesheng Wu, "Proximity Portability and In Transit, M-to-N Data Partitioning and Movement in SENSEI", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_20

E. Wes Bethel, Burlen Loring, Utkarsh Ayatchit, David Camp, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, Kesheng Wu, "The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_13

Franz Poeschel, Juncheng E, William F. Godoy, Norbert Podhorszki, Scott Klasky, Greg Eisenhauer, Philip E. Davis, Lipeng Wan, Ana Gainaru, Junmin Gu, Fabian Koller, René Widera, Michael Bussmann, Axel Huebl, "Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2", Smoky Mountains Computational Sciences and Engineering Conference (SMC2021), 2021,

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

G Guidi, M Ellis, A Buluç, K Yelick, D Culler, "10 years later: Cloud computing is closing the performance gap", ICPE 2021 - Companion of the ACM/SPEC International Conference on Performance Engineering, January 1, 2021, 41--48, doi: 10.1145/3447545.3451183

S. Burroughs, B. Lincoln, A. Adeel, I. Severinsen, A. Lee, O. Amusat, D. Gunter, B. Nicholson, M. Apperley, B. Young, J. Siirola, T. G. Walmsley, "New Directions and Software Tools Within the Process Systems Engineering Ecosystem", Systems and Control Transactions, Ghent, Belgium, PSE Press: Hamilton, July 1, 2025, 4:430-436, doi: https://doi.org/10.69997/sct.156838

Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan, "Automated annotation of scientific texts for ML-based keyphrase extraction and validation", Database, September 27, 2024, 2024:baae093, doi: https://doi.org/10.1093/database/baae093

Mohammed A. Alhussaini, Zachary M. Binger, Bianca M. Souza-Chaves, Oluwamayowa O. Amusat, Jangho Park, Timothy V. Bartholomew, Dan Gunter, Andrea Achilli, "Analysis of backwash settings to maximize net water production in an engineering-scale ultrafiltration system for water reuse", Journal of Water Process Engineering, 2023, 53, doi: 10.1016/j.jwpe.2023.103761

Andrew Adams, Emily K. Adams, Dan Gunter, Ryan Kiser, Mark Krenz, Sean Peisert, John Zage, "Roadmap for Securing Operational Technology in NSF Scientific Research", Trusted CI Report, November 16, 2022, doi: 10.5281/zenodo.7327987

Emily K. Adams, Daniel Gunter, Ryan Kiser, Mark Krenz, Sean Peisert, Susan Sons, John Zage, "Findings of the 2022 Trusted CI Study on the Security of Operational Technology in NSF Scientific Research", Trusted CI Report, July 15, 2022, doi: doi.org/10.5281/zenodo.6828675

Dan Gunter, Oluwamayowa Amusat, Tim Bartholomew, Markus Drouven, "Santa Barbara Desalination Digital Twin Technical Report", LBNL Technical Report, 2021, LBNL LBNL-2001437,

Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan, "Experiences with Reproducibility: Case Studies from Scientific Workflows", (P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, ACM, June 21, 2021, doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

Ankur K. Gupta, Benjamin C. Gamoke, Krishnan Raghavachari, Interaction–Deletion: A Composite Energy Method for the Optimization of Molecular Systems Selectively Removing Specific Nonbonded Interactions, The Journal of Physical Chemistry A, Pages: 4668-4682 2021, doi: 10.1021/acs.jpca.1c02918

Paul H. Hargrove, Dan Bonachea, "Investigation into the Performance Benefits of Exposing Network Backpressure in UPC++ and GASNet-EX", Lawrence Berkeley National Laboratory Technical Report, May 2025, LBNL 2001668, doi: 10.25344/S4088R

This document is a brief summary of the research, and supporting development efforts, conducted by the project "Investigation into Improving Dynamic Adaptivity to System-Level Asynchrony in UPC++". We tested the hypothesis "The UPC++ and GASNet-EX runtimes can expose information from the network stack that enables applications to dynamically adapt to congestion, improving total throughput". We present experimental results from both a microbenchmark and an application benchmark that support this hypothesis.

Dan Bonachea, Paul H. Hargrove, "GASNet-EX Specification Collection, Revision 2024.5.0", Lawrence Berkeley National Laboratory Tech Report, May 2024, LBNL 2001595, doi: 10.25344/S4160B

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in emerging exascale systems. It provides network-independent, high-performance communication primitives including Remote Memory Access (RMA) and Active Messages (AM). GASNet-EX is an evolution of the popular GASNet communication system, building upon over 20 years of lessons learned, and the primary goals are high performance, interface portability, and expressiveness. The library has been used to implement parallel programming models and libraries such as UPC, UPC++, Fortran coarrays, Legion, Chapel, and many others.

This anthology collects together the four separate volumes that currently comprise the GASNet-EX specification, as of the 2024.5.0 release of GASNet-EX.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Julian Bellavita, Mathias Jacquelin, Esmond G. Ng, Dan Bonachea, Johnny Corbino, Paul H. Hargrove, "symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver", 2023 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'23), ACM, November 13, 2023, doi: 10.1145/3624062.3624600

Sparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran, Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23), November 12, 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models.

The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.

Come join us to learn about some productive and performant parallel programming models!

SC23 event page

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran (CUF23), ECP/NERSC/OLCF Tutorial, July 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models. This tutorial should be accessible to users with little-to-no parallel programming experience, and everyone is welcome. A partial differential equation example will be demonstrated in all three programming models along with performance and scaling results on big machines. That example and others will be provided in a cloud instance and Docker container. Attendees will be shown how to compile and run these programming examples, and provided opportunities to experiment with different parameters and code alternatives while being able to ask questions and share their own observations. Come join us to learn about some productive and performant parallel programming models!

Secondary tutorial sites by event sponsors:

Paul H. Hargrove, PGAS Programming Models: My 20-year Perspective, Keynote for 10th Annual Chapel Implementers and Users Workshop (CHIUW 2023), June 2, 2023, doi: 10.25344/S4K59C

Paul H. Hargrove has been involved in the world of Partitioned Global Address Space (PGAS) programming models since 1999, before he knew such a thing existed. Early involvement in the GASNet communications library as used in implementations of UPC, Titanium and Co-array Fortran convinced Paul that one could have productivity and performance without sacrificing one for the other. Since then he has been among the apostates who work to overturn the belief that message-passing is the only (or best) way to program for High-Performance Computing (HPC). Paul has been fortunate to witness the history of the PGAS community through several rare opportunities, including interactions made possible by the wide adoption of GASNet and through operating a PGAS booth at the annual SC conferences from 2007 to 2017. In this talk, Paul will share some highlights of his experiences across 24 years of PGAS history. Among these is the DARPA High Productivity Computing Systems (HPCS) project which helped give birth to Chapel.

CHIUW 2023 website

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Johnny Corbino, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'23)", Poster at Exascale Computing Project (ECP) Annual Meeting 2023, January 2023,

The Pagoda project is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. The first component is GASNet-EX, a portable, high-performance, global-address-space communication library. The second component is UPC++, a C++ template library. Together, these libraries enable agile, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

GASNet-EX is a portable, high-performance communications middleware library which leverages hardware support to implement Remote Memory Access (RMA) and Active Message communication primitives. GASNet-EX supports a broad ecosystem of alternative HPC programming models, including UPC++, Legion, Chapel and multiple implementations of UPC and Fortran Coarrays. GASNet-EX is implemented directly over the native APIs for networks of interest in HPC. The tight semantic match of GASNet-EX APIs to the client requirements and hardware capabilities often yields better performance than competing libraries.

UPC++ provides high-level productivity abstractions appropriate for Partitioned Global Address Space (PGAS) programming such as: remote memory access (RMA), remote procedure call (RPC), support for accelerators (e.g. GPUs), and mechanisms for aggressive asynchrony to hide communication costs. UPC++ implements communication using GASNet-EX, delivering high performance and portability from laptops to exascale supercomputers. HPC application software using UPC++ includes: MetaHipMer2 metagenome assembler, SIMCoV viral propagation simulation, NWChemEx TAMM, and graph computation kernels from ExaGraph.

"Berkeley Lab’s Networking Middleware GASNet Turns 20: Now, GASNet-EX is Gearing Up for the Exascale Era", Linda Vu, HPCWire (Lawrence Berkeley National Laboratory CS Area Communications), December 7, 2022, doi: 10.25344/S4BP4G

GASNet Celebrates 20th Anniversary

For 20 years, Berkeley Lab’s GASNet has been fueling developers’ ability to tap the power of massively parallel supercomputers more effectively. The middleware was recently upgraded to support exascale scientific applications.

Paul H. Hargrove, Dan Bonachea, "GASNet-EX RMA Communication Performance on Recent Supercomputing Systems", 5th Annual Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'22), November 2022, doi: 10.25344/S40C7D

Partitioned Global Address Space (PGAS) programming models, typified by systems such as Unified Parallel C (UPC) and Fortran coarrays, expose one-sided Remote Memory Access (RMA) communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in emerging exascale machines. The library is an evolution of the popular GASNet communication system, building upon 20 years of lessons learned. We present microbenchmark results which demonstrate the RMA performance of GASNet-EX is competitive with MPI implementations on four recent, high-impact, production HPC systems. These results are an update relative to previously published results on older systems. The networks measured here are representative of hardware currently used in six of the top ten fastest supercomputers in the world, and all of the exascale systems on the U.S. DOE road map.

Talk Slides

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Dan Bonachea, Paul H. Hargrove, An Introduction to GASNet-EX for Chapel Users, 9th Annual Chapel Implementers and Users Workshop (CHIUW 2022), June 10, 2022,

Have you ever typed "export CHPL_COMM=gasnet"? If you’ve used Chapel with multi-locale support on a system without "Cray" in the model name, then you’ve probably used GASNet. Did you ever wonder what GASNet is? What GASNet should mean to you? This talk aims to answer those questions and more. Chapel has system-specific implementations of multi-locale communication for Cray-branded systems including the Cray XC and HPE Cray EX lines. On other systems, Chapel communication uses the GASNet communication library embedded in third-party/gasnet. In this talk, that third-party will introduce itself to you in the first person.

Video Presentation

Paul H. Hargrove, Dan Bonachea, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'22)", Poster at Exascale Computing Project (ECP) Annual Meeting 2022, May 5, 2022,

We present UPC++ and GASNet-EX, distributed libraries which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Procedure Call (RPC) and for Remote Memory Access (RMA) to host and GPU memories. The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Daniel Waters, Colin A. MacLean, Dan Bonachea, Paul H. Hargrove, "Demonstrating UPC++/Kokkos Interoperability in a Heat Conduction Simulation (Extended Abstract)", Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2021, doi: 10.25344/S4630V

We describe the replacement of MPI with UPC++ in an existing Kokkos code that simulates heat conduction within a rectangular 3D object, as well as an analysis of the new code’s performance on CUDA accelerators. The key challenges were packing the halos in Kokkos data structures in a way that allowed for UPC++ remote memory access, and streamlining synchronization costs. Additional UPC++ abstractions used included global pointers, distributed objects, remote procedure calls, and futures. We also make use of the device allocator concept to facilitate data management in memory with unique properties, such as GPUs. Our results demonstrate that despite the algorithm’s good semantic match to message passing abstractions, straightforward modifications to use UPC++ communication deliver vastly improved performance and scalability in the common case. We find the one-sided UPC++ version written in a natural way exhibits good performance, whereas the message-passing version written in a straightforward way exhibits performance anomalies. We argue this represents a productivity benefit for one-sided communication models.

PAW-ATM'21

Paul H. Hargrove, Dan Bonachea, Colin A. MacLean, Daniel Waters, "GASNet-EX Memory Kinds: Support for Device Memory in PGAS Programming Models", The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'21) Research Poster, November 2021, doi: 10.25344/S4P306

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work includes two major components: UPC++ (a C++ template library) and GASNet-EX (a portable, high-performance communication library). This poster describes recent advances in GASNet-EX to efficiently implement Remote Memory Access (RMA) operations to and from memory on accelerator devices such as GPUs. Performance is illustrated via benchmark results from UPC++ and the Legion programming system, both using GASNet-EX as their communications library.

Katherine A. Yelick, Amir Kamil, Damian Rouson, Dan Bonachea, Paul H. Hargrove, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC21), Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), November 15, 2021,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. UPC++ offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between computation and asynchronous data movement. UPC++ supports simple/regular data structures as well as more elaborate distributed applications where communication is fine-grained and/or irregular. UPC++ provides a uniform abstraction for one-sided RMA between host and GPU/accelerator memories anywhere in the system. UPC++'s support for aggressive asynchrony enables applications to effectively overlap communication and reduce latency stalls, while the underlying GASNet-EX communication library delivers efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces UPC++, covering the memory and execution models and basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into application proxy examples. We examine a few UPC++ applications with irregular communication (metagenomic assembler and COVID-19 simulation) and describe how they utilize UPC++ to optimize communication performance.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Max Grossman, Amir Kamil, Colin A. MacLean, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'21)", Poster at Exascale Computing Project (ECP) Annual Meeting 2021, April 2021,

We present UPC++ and GASNet-EX, which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Memory Access (RMA) and Remote Procedure Call (RPC). The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems

Dan Bonachea, GASNet-EX: A High-Performance, Portable Communication Library for Exascale, Berkeley Lab – CS Seminar, March 10, 2021,

Download File: GASNet-2021-LBL-seminar-slides.pdf (pdf: 9.1 MB)

Partitioned Global Address Space (PGAS) models, pioneered by languages such as Unified Parallel C (UPC) and Co-Array Fortran, expose one-sided communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in future exascale machines. The library is an evolution of the popular GASNet communication system, building on 20 years of lessons learned. We describe several features and enhancements that have been introduced to address the needs of modern runtimes and exploit the hardware capabilities of emerging systems. Microbenchmark results demonstrate the RMA performance of GASNet-EX is competitive with several MPI implementations on current systems. GASNet-EX provides communication services that help to deliver speedups in HPC applications written using the UPC++ library, enabling new science on pre-exascale systems.

Ammar Haydari, Michael Zhang, Chen-Nee Chuah, Jane Macfarlane, Sean Peisert, Adaptive Differential Privacy Mechanism for Aggregated Mobility Dataset, arXiv preprint arXiv:2112.08487, December 10, 2021,

C Varadharajan, Z Kakalia, E Alper, EL Brodie, M Burrus, RWH Carroll, D Christianson, W Dong, V Hendrix, M Henderson, S Hubbard, D Johnson, R Versteeg, KH Williams, DA Agarwal, The Colorado East River Community Observatory Data Collection, Hydrological Processes 35(6), 2021, doi: 10.22541/au.161962485.54378235/v1

H Weierbach, AR Lima, JD Willard, VC Hendrix, DS Christianson, M Lubich, C Varadharajan, Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning, Water (Switzerland), 2022, doi: 10.3390/w14071032

C Varadharajan, AP Appling, B Arora, DS Christianson, VC Hendrix, V Kumar, AR Lima, J Müller, S Oliver, M Ombadi, T Perciano, JM Sadler, H Weierbach, JD Willard, Z Xu, J Zwart, "Can machine learning accelerate process understanding and decision-relevant predictions of river water quality?", Hydrological Processes, January 1, 2022, 36, doi: 10.1002/hyp.14565

MB Simmonds, WJ Riley, DA Agarwal, X Chen, S Cholia, R Crystal-Ornelas, ET Coon, D Dwivedi, VC Hendrix, M Huang, A Jan, Z Kakalia, J Kumar, CD Koven, L Li, M Melara, L Ramakrishnan, DM Ricciuto, AP Walker, W Zhi, Q Zhu, C Varadharajan, Guidelines for Publicly Archiving Terrestrial Model Data to Enhance Usability, Intercomparison, and Synthesis, Data Science Journal, 2022, doi: 10.5334/dsj-2022-003

C Varadharajan, VC Hendrix, DS Christianson, M Burrus, C Wong, SS Hubbard, DA Agarwal, BASIN-3D: A brokering framework to integrate diverse environmental data, Computers and Geosciences, 2022, doi: 10.1016/j.cageo.2021.105024

C Varadharajan, Z Kakalia, E Alper, EL Brodie, M Burrus, RWH Carroll, D Christianson, W Dong, V Hendrix, M Henderson, S Hubbard, D Johnson, R Versteeg, KH Williams, DA Agarwal, The Colorado East River Community Observatory Data Collection, Hydrological Processes 35(6), 2021, doi: 10.22541/au.161962485.54378235/v1

JE Damerow, C Varadharajan, K Boye, EL Brodie, M Burrus, KD Chadwick, R Crystal-Ornelas, H Elbashandy, RJ Eloy Alves, KS Ely, AE Goldman, T Haberman, V Hendrix, Z Kakalia, KM Kemner, AB Kersting, N Merino, F O Brien, Z Perzan, E Robles, P Sorensen, JC Stegen, RL Walls, P Weisenhorn, M Zavarin, D Agarwal, Sample identifiers and metadata to support data management and reuse in multidisciplinary ecosystem sciences, Data Science Journal, 2021, doi: 10.5334/dsj-2021-011

R Crystal-Ornelas, C Varadharajan, B Bond-Lamberty, K Boye, M Burrus, S Cholia, M Crow, J Damerow, R Devarakonda, KS Ely, A Goldman, S Heinz, V Hendrix, Z Kakalia, SC Pennington, E Robles, A Rogers, M Simmonds, T Velliquette, H Weierbach, P Weisenhorn, JN Welch, DA Agarwal, A Guide to Using GitHub for Developing and Versioning Data Standards and Reporting Formats, Earth and Space Science, 2021, doi: 10.1029/2021EA001797

"Arte inspirando la informática cuántica en el Advanced Quantum Testbed", Monica Hernandez, July 7, 2023,

"Art Inspiring a Quantum-Ready Vision at the Advanced Quantum Testbed", Monica Hernandez, July 7, 2023,

"Éxito reportado en la generación de operaciones cuánticas entrelazadas de dos cutrits con alta fidelidad", Monica Hernandez, July 6, 2023,

"Success Generating Two-Qutrit Entangling Gates With High Fidelity", Monica Hernandez, July 6, 2023,

"Innovating quantum computers with fluxonium processors", Monica Hernandez, News release, April 11, 2023,

Monica Hernandez, "Quantum Systems Accelerator 2023 Impact Report", Impact Report, March 17, 2023,

Monica Hernandez, Quantum Computing Workshop Brings Classical Control Systems Into Focus, News release, December 20, 2022,

"Jumpstarting the Future Quantum Workforce", Monica Hernandez, Feature, December 13, 2022,

"The Sparks That Ignited Curiosity: How Quantum Researchers Found Their Path", Monica Hernandez, Feature, October 14, 2022,

"La curiosidad por la informática cuántica: Cómo cinco científicos encontraron su especialización", Monica Hernandez, Feature in Spanish, October 14, 2022,

"El Advanced Quantum Testbed en Berkeley Lab lidera avances científicos para la computación cuántica", Monica Hernandez, Feature in Spanish, October 14, 2022,

"How Berkeley Lab’s Advanced Quantum Testbed Paves Breakthroughs for Quantum Computing", Monica Hernandez, Feature, October 14, 2022,

"How the Five National Quantum Information Science Research Centers Harness the Quantum Revolution", Hannah Adams, Pete Genzer, Monica Hernandez, Leah Hesla, Scott Jones, Elizabeth Rosenthal, Denise Yazak, August 26, 2022,

"QIS Innovation Across the Growing R&D Ecosystem", Monica Hernandez, Feature, August 25, 2022,

Monica Hernandez, Optimizing SWAP Networks for Quantum Computing, News release, August 4, 2022,

"QSA Scientists Participated in ‘QIS For Everyone’ Briefing", Monica Hernandez, Feature, July 13, 2022,

Monica Hernandez, Breakthrough in Quantum Universal Gate Sets: A High-Fidelity iToffoli Gate, News release, May 24, 2022,

"Inspiring High Schoolers to Learn Quantum Computing", Monica Hernandez, Feature, April 14, 2022,

"Meet QSA’s Early-Career Researchers Advancing the QIS Frontier", Monica Hernandez, Feature, April 14, 2022,

"AQT-Zurich Instruments Partnership Enables Groundbreaking Quantum Information Science", Monica Hernandez, Feature, April 14, 2022,

Monica Hernandez, "Advanced Quantum Testbed 2021 Progress Report", Progress Report, April 14, 2022,

Monica Hernandez, Joe Chew, Open Sourced Control Hardware for Quantum Computers, News release, February 24, 2022,

"QSA’s Science Breakthroughs in 2021", Monica Hernandez, Feature, February 17, 2022,

"Advancing Quantum Engineering: A Must-Do for Quantum Computing", Monica Hernandez, Feature, December 20, 2021,

"How the Advanced Quantum Testbed Prepares the New Quantum Workforce", Monica Hernandez, Feature, December 14, 2021,

Monica Hernandez, Crucial Leap in Error Mitigation for Quantum Computers, News release, December 9, 2021,

Monica Hernandez, How a Novel Radio Frequency Control System Enhances Quantum Computers, News release, November 9, 2021,

"Rising Talent in Quantum Computing: Meet Early Career Researchers at QSA", Monica Hernandez, November 4, 2021,

"K-12 Career Talk: A Day in the Life of an AQT scientist", Monica Hernandez, Feature, October 22, 2021,

"El Advanced Quantum Testbed avanza tecnologías y talento para la computación cuántica", Monica Hernandez, Feature in Spanish, October 13, 2021,

"The Advanced Quantum Testbed Propels Quantum Information Technologies and Talent", Monica Hernandez, Feature, October 13, 2021,

"Why QSA Advances 2D Materials for Quantum Computing", Monica Hernandez, Feature, September 28, 2021,

Monica Hernandez, Raising the Bar in Error Characterization for Qutrit-Based Quantum Computing, News release, September 20, 2021,

"How the Quantum Systems Accelerator Set A Shared Direction in Electronic Controls for Quantum Computing", Monica Hernandez, Feature, August 20, 2021,

"Leading with Breakthrough Science at the Advanced Quantum Testbed User Program", Monica Hernandez, Feature, July 29, 2021,

"The Quantum Systems Accelerator Hosts First Industry Roundtable", Monica Hernandez, Feature, June 22, 2021,

"AQT Positions Itself as Hub for Quantum Computing Startups", Monica Hernandez, Feature, June 16, 2021,

Jhe-Yu Liou, Muaaz Awan, Kirtus Leyba, Petr Sulc, Steven Hofmeyr, Carole-Jean Wu, Stephanice Forrest, "Evolving to find optimizations humans miss: using evolutionary computation to improve GPU code for bioinformatics applications", ACM Transactions on Evolutionary Learning and Optimization, November 15, 2024, doi: 10.1145/3703920

Leyba K, Hofmeyr S, Forrest S, Cannon J, Moses M, "SIMCoV-GPU: Accelerating an Agent-Based Model for Exascale", HPDC '24, August 30, 2024, doi: 10.1145/3625549.3658692

Hofmeyr S, Buluç A, Riley R, Egan R, Selvitopi O, Oliker L, Yelick K, Shakya M, Youtsey B, Azad A, "Exabiome: Advancing Microbial Science through Exascale Computing", Computing in Science & Engineering, April 1, 2024, doi: 10.1109/MCSE.2024.3402546

Oliver T, Varghese N, Roux S, Schulz F, Huntemann M, Clum A, Foster B, Foster B, Riley R, LaButti K, Egan R, Hajek P, Mukherjee S, Ovchinnikova G, Reddy TBK, Calhoun S, Hayes RD, Rohwer RR, Zhou Z, Daum C, Copeland A, Chen I-MA, Ivanova NN, Kyrpides NC, Mouncey NJ, del Rio TG, Grigoriev IV, Hofmeyr S, Oliker L, Yelick K, Anantharaman K, McMahon KD, Woyke T, Eloe-Fadrosh EA, "Coassembly and binning of a twenty-year metagenomic time-series from Lake Mendota", Nature Scientific Data, January 1, 2024, doi: 10.1038/S41597-024-03826-8

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Riley R, Bowers RM, Camargo AP, Campbell A, Egan R, Eloe-Fadrosh EA, Foster B, Hofmeyr S, Huntemann M, Kellom M, Kimbrel JA, Oliker L, Yelick K, Pett-Ridge J, Salamov A, Varghese NJ, Clum A, "Terabase-Scale Coassembly of a Tropical Soil Microbiome", Microbiology Spectrum, August 17, 2023, doi: 10.1128/SPECTRUM.00200-23

Popovici DT, Awan MG, Guidi G, Egan R, Hofmeyr S, Oliker L, Yelick K, "Designing Efficient SIMD Kernels for High Performance Sequence Alignment", 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 19, 2023, doi: 10.1109/IPDPSW59300.2023.00038

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

McCoy H, Hofmeyr S, Yelick K, Pandey P, "High-Performance Filters for GPUs", Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 25, 2023, doi: 10.1145/3572848.3577507

"Singleton Sieving: Overcoming the Memory/Speed Trade-Off in Exascale k-mer Analysis", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), January 1, 2023, doi: 10.25344/S4TP4T

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Liou J-Y, Awan M, Hofmeyr S, Forrest S, Wu C-J, "Understanding the Power of Evolutionary Computation for GPU Code Optimization", 2022 IEEE International Symposium on Workload Characterization (IISWC), August 11, 2022, doi: 10.1109/IISWC55918.2022.00025

Meyer F, Fritz A, Deng Z-L, Koslicki D, Lesker TR, Gurevich A, Robertson G, Alser M, Antipov D, Beghini F, Bertrand D, Brito JJ, Brown CT, Buchmann J, Buluç A, Chen B, Chikhi R, Clausen PTLC, Cristian A, Dabrowski PW, Darling AE, Egan R, Eskin E, Georganas E, Goltsman E, Gray MA, Hansen LH, Hofmeyr S, Huang P, Irber L, Jia H, Jørgensen TS, Kieser SD, Klemetsen T, Kola A, Kolmogorov M, Korobeynikov A, Kwan J, LaPierre N, Lemaitre C, Li C, Limasset A, Malcher-Miranda F, Mangul S, Marcelino VR, Marchet C, Marijon P, Meleshko D, Mende DR, Milanese A, Nagarajan N, Nissen J, Nurk S, Oliker L, Paoli L, Peterlongo P, Piro VC, Porter JS, Rasmussen S, Rees ER, Reinert K, Renard B, Robertsen EM, Rosen GL, Ruscheweyh H-J, Sarwal V, Segata N, Seiler E, Shi L, Sun F, Sunagawa S, Sørensen SJ, Thomas A, Tong C, Trajkovski M, Tremblay J, Uritskiy G, Vicedomini R, Wang Z, Wang Z, Wang Z, Warren A, Willassen NP, Yelick K, You R, Zeller G, Zhao Z, Zhu S, Zhu J, Garrido-Oter R, Gastmeier P, Hacquard S, Häußler S, Khaledi A, Maechler F, Mesny F, Radutoiu S, Schulze-Lefert P, Smit N, Strowig T, Bremges A, Sczyrba A, McHardy AC, "Critical Assessment of Metagenome Interpretation: the second round of challenges", Nature Methods, April 1, 2022, doi: 10.1038/S41592-022-01431-4

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Melanie E. Moses, Steven Hofmeyr, Judy L Cannon, Akil Andrews, Rebekah Gridley, Monica Hinga, Kirtus Leyba, Abigail Pribisova, Vanessa Surjadidjaja, Humayra Tasnim, Stephanie Forrest, "Spatially distributed infection increases viral load in a computational model of SARS-CoV-2 lung infection", PLOS Computational Biology, December 2021, 17(12), doi: 10.1371/journal.pcbi.1009735

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

MG Awan, S Hofmeyr, R Egan, N Ding, A Buluc, J Deslippe, L Oliker, K Yelick, "Accelerating Large Scale de novo Metagenome Assembly Using GPUs", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2021, doi: 10.1145/3458817.3476212

Patricia Gonzalez-Guerrero, Κylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "Towards practical superconducting accelerators for machine learning using U-SFQ", ACM Journal on Emerging Technologies in Computing Systems, April 2024,

Meriam Gay Bautista, Darren Lyles, Kylie Huch, Patricia Gonzalez-Guerrero, George Michelogiannakis, "Area Efficient Asynchronous SFQ Pulse Round-Robin Distribution Network", IEEE Transactions on Circuits and Systems I: Regular Papers, November 2023,

Kylie Huch, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Hyperdimensional Associative Memory Circuit for Scalable Machine Learning", IEEE Transactions on Applied Superconductivity, May 2023,

Patricia Gonzalez-Guerrero, Kylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "An Area Efficient Superconducting Unary CNN Accelerator", IEEE 24th International Symposium on Quality Electronic Design (ISQED), IEEE, April 2023,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, Kylie Huch, George Michelogiannakis, "Superconducting Digital DIT Butterfly Unit for Fast Fourier Transform Using Race Logic", 2022 20th IEEE Interregional NEWCAS Conference (NEWCAS), IEEE, June 2022, 441-445,

Mathias Weiden, Justin Kalloor, John Kubiatowicz, Ed Younis, Costin Iancu, "Wide Quantum Circuit Optimization with Topology Aware Synthesis", Third International Workshop on Quantum Computing Software, November 13, 2022,

Unitary synthesis is an optimization technique that can achieve optimal gate counts while mapping quantum circuits to restrictive qubit topologies. Synthesis algorithms are limited in scalability by their exponentially growing run times. Application to wide circuits requires partitioning into smaller components. In this work, we explore methods to reduce depth and multi-qubit gate count of wide, mapped quantum circuits using synthesis. We present TopAS, a topology aware synthesis tool that preconditions quantum circuits before mapping. Partitioned subcircuits are optimized and fitted to sparse subtopologies to balance the opposing demands of synthesis and mapping algorithms. Compared to state of the art wide circuit synthesis algorithms, TopAS is able to reduce depth on average by 35.2% and CNOT count by 11.5% for mesh topologies. Compared to the optimization and mapping algorithms of Qiskit and Tket, TopAS is able to reduce CNOT counts by 30.3% and depth by 38.2% on average.

Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu, QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space, Bulletin of the American Physical Society, March 2021,

We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.

Akel Hashim, Ravi Naik, Alexis Morvan, Jean-Loup Ville, Brad Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin O Brien, Ian Hincks, Joel Wallman, Joseph V Emerson, David Ivan Santiago, Irfan Siddiqi, Scalable Quantum Computing on a Noisy Superconducting Quantum Processor via Randomized Compiling, Bulletin of the American Physical Society, 2021,

Coherent errors in quantum hardware severely limit the performance of quantum algorithms in an unpredictable manner, and mitigating their impact is necessary for realizing reliable, large-scale quantum computations. Randomized compiling achieves this goal by converting coherent errors into stochastic noise, dramatically reducing unpredictable errors in quantum algorithms and enabling accurate predictions of aggregate performance via cycle benchmarking estimates. In this work, we demonstrate significant performance gains under randomized compiling for both the four-qubit quantum Fourier transform algorithm and for random circuits of variable depth on a superconducting quantum processor. We also validate solution accuracy using experimentally-measured error rates. Our results demonstrate that randomized compiling can be utilized to maximally-leverage and predict the capabilities of modern-day noisy quantum processors, paving the way forward for scalable quantum computing.

Wei Zhang, Khaled Ibrahim, Suren Byna, "Optimizing Distributed Object Storage I/O for Large-scale Parallel GNN Training on Atomistic Graphs", UnderReview, July 11, 2025,

Abdullah Alperen, Nan Ding, Khaled Z. Ibrahim, Pieter Maris, Leonid Oliker, Chao Yang, Hasan Metin Aktulga, "Optimizing Nuclear Configuration Interaction Calculations on GPUs: A Comparative Performance Study of Programming Models", https://isc.app.swapcard.com/event/isc-high-performance-2025/planning/UGxhbm5pbmdfMjU4OTMyNg==, June 12, 2025,

Download File: ISC25_MFDn_opt.pdf (pdf: 7.7 MB)

Khaled Z Ibrahim, Chao Yang, Pieter Maris, "Performance portability of sparse block diagonal matrix multiple vector multiplications on gpus", 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), January 30, 2023, doi: 10.1109/P3HPC56579.2022.00011

K. Ibrahim, L. Oliker,, "Preprocessing Pipeline Optimization for Scientific Deep-Learning Workloads", IPDPS 22, June 3, 2022,

Download File: SciML-optimization-12.pdf (pdf: 17 MB)

Roel Van Beeumen, Khaled Z. Ibrahim, Gregory D. Kahanamoku-Meyer, Norman Y. Yao, Chao Yang, "Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization", The International Journal of High Performance Computing Applications, March 19, 2022, 36:307-319, doi: 10.1177/10943420211060365

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Khaled Ibrahim, Roofline on GPUs (advanced topics), ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-6-advanced.pdf (pdf: 15 MB)

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Julian Bellavita, Mathias Jacquelin, Esmond G. Ng, Dan Bonachea, Johnny Corbino, Paul H. Hargrove, "symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver", 2023 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'23), ACM, November 13, 2023, doi: 10.1145/3624062.3624600

Sparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

R. Jambunathan, Z. Yao, R. Lombardini, A. Rodriguez, and A. Nonaka, "Two-Fluid Physical Modeling of Superconducting Resonators in the ARTEMIS Framework", Computer Physics Communications, October 2, 2023, 291:108836,

P. Kumar, A. Nonaka, R. Jambunathan, G. Pahwa, S. Salahuddin, and Z. Yao, "FerroX: A GPU-accelerated, 3D Phase-Field Simulation Framework for Modeling Ferroelectric Devices", Computer Physics Communications, September 1, 2023, 290:108757,

H. Klion, R. Jambunathan, M. E. Rowan, E. Yang, D. Willcox, J.-L. Vay, R. Lehe, A. Myers, A. Huebl, W. Zhang, "Particle-in-Cell Simulations of Relativistic Magnetic Reconnection with Advanced Maxwell Solver Algorithms", The Astrophysical Journal, July 13, 2023, 952,

S. S. Sawant, Z. Yao, R. Jambunathan, A. Nonaka, "Characterization of Transmission Lines in Microelectronic Circuits Using the ARTEMIS Solver", IEEE Journal on Multiscale and Multiphysics Computational Techniques, December 12, 2022, 8:31-39,

Z. Yao, R. Jambunathan, Y. Zeng, and A. Nonaka, "A Massively Parallel Time-Domain Coupled Electrodynamics-Micromagnetics Solver", International Journal of High Performance Computing Applications, January 10, 2022, accepted,

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Download File: P3HPC24_bricks_mg_final.pdf (pdf: 358 KB)

Mahesh Lakshminarasimhan, Oscar Antepara, Tuowen Zhao, Benjamin Sepanski, Protonu Basu, Hans Johansen, Mary Hall, Samuel Williams, "Bricks: A high-performance portability layer for computations on block-structured grids", The International Journal of High Performance Computing Applications (IJHPCA), August 19, 2024, doi: 10.1177/10943420241268288

Will Thacher, Hans Johansen, Daniel Martin, "A high order cut-cell method for solving the shallow-shelf equations", Journal of Computational Science, August 1, 2024, 80, doi: 10.1016/j.jocs.2024.102319

Daniel F. Martin, Steven B. Roberts, Hans Johansen, David J Gardner, Carol S Woodward, "Impacts of improved time evolution in BISICLES using SUNDIALS", December 14, 2023,

Download File: AGU2023Sundials.pdf (pdf: 1 MB)

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Download File: P3HPC23_bricks_final-v4.pdf (pdf: 684 KB)

Will Thacher and Hans Johansen and Daniel Martin, "A high order Cartesian grid, finite volume method for elliptic interface problems", Journal of Computational Physics, October 15, 2023, 491, doi: 10.1016/j.jcp.2023.112351

Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,

Download File: MCHPC22_final.pdf (pdf: 401 KB)

Tuowen Zhao, Mary Hall, Hans Johansen, Samuel Williams, "Improving Communication by Optimizing On-Node Data Movement with Data Layout", PPoPP, February 2021,

Download File: PPoPP-Bricks-MPI-final.pdf (pdf: 864 KB)

Alina Lazar, others, Accelerating the Inference of the Exa.TrkX Pipeline, 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research: AI Decoded - Towards Sustainable, Diverse, Performant and Effective Scientific Computing, 2022,

Chun-Yi Wang, others, Reconstruction of Large Radius Tracks with the Exa.TrkX pipeline, 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research: AI Decoded - Towards Sustainable, Diverse, Performant and Effective Scientific Computing, 2022,

Xiangyang Ju, others, Performance of a geometric deep learning pipeline for HL-LHC particle tracking, Eur. Phys. J. C, Pages: 876 2021, doi: 10.1140/epjc/s10052-021-09675-8

Katherine Rasmussen, Damian Rouson, Naje George, Dan Bonachea, Hussain Kadhem, Brian Friesen, "Agile Acceleration of LLVM Flang Support for Fortran 2018 Parallel Programming", Research Poster at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), November 2022, doi: 10.25344/S4CP4S

The LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “Coarray Fortran.” We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).

Extended Abstract and Poster

Video presentation

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001561, December 2023, doi: 10.25344/S4J592

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran, Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23), November 12, 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models.

The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.

Come join us to learn about some productive and performant parallel programming models!

SC23 event page

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran (CUF23), ECP/NERSC/OLCF Tutorial, July 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models. This tutorial should be accessible to users with little-to-no parallel programming experience, and everyone is welcome. A partial differential equation example will be demonstrated in all three programming models along with performance and scaling results on big machines. That example and others will be provided in a cloud instance and Docker container. Attendees will be shown how to compile and run these programming examples, and provided opportunities to experiment with different parameters and code alternatives while being able to ask questions and share their own observations. Come join us to learn about some productive and performant parallel programming models!

Secondary tutorial sites by event sponsors:

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 31, 2023, LBNL 2001516, doi: 10.25344/S46W2J

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Johnny Corbino, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'23)", Poster at Exascale Computing Project (ECP) Annual Meeting 2023, January 2023,

The Pagoda project is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. The first component is GASNet-EX, a portable, high-performance, global-address-space communication library. The second component is UPC++, a C++ template library. Together, these libraries enable agile, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

GASNet-EX is a portable, high-performance communications middleware library which leverages hardware support to implement Remote Memory Access (RMA) and Active Message communication primitives. GASNet-EX supports a broad ecosystem of alternative HPC programming models, including UPC++, Legion, Chapel and multiple implementations of UPC and Fortran Coarrays. GASNet-EX is implemented directly over the native APIs for networks of interest in HPC. The tight semantic match of GASNet-EX APIs to the client requirements and hardware capabilities often yields better performance than competing libraries.

UPC++ provides high-level productivity abstractions appropriate for Partitioned Global Address Space (PGAS) programming such as: remote memory access (RMA), remote procedure call (RPC), support for accelerators (e.g. GPUs), and mechanisms for aggressive asynchrony to hide communication costs. UPC++ implements communication using GASNet-EX, delivering high performance and portability from laptops to exascale supercomputers. HPC application software using UPC++ includes: MetaHipMer2 metagenome assembler, SIMCoV viral propagation simulation, NWChemEx TAMM, and graph computation kernels from ExaGraph.

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001480, doi: 10.25344/S4M59P

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'22)", Poster at Exascale Computing Project (ECP) Annual Meeting 2022, May 5, 2022,

We present UPC++ and GASNet-EX, distributed libraries which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Procedure Call (RPC) and for Remote Memory Access (RMA) to host and GPU memories. The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001452, doi: 10.25344/S4530J

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

Amir Kamil, Dan Bonachea, "Optimization of Asynchronous Communication Operations through Eager Notifications", Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2021, doi: 10.25344/S42C71

UPC++ is a C++ library implementing the Asynchronous Partitioned Global Address Space (APGAS) model. We propose an enhancement to the completion mechanisms of UPC++ used to synchronize communication operations that is designed to reduce overhead for on-node operations. Our enhancement permits eager delivery of completion notification in cases where the data transfer semantics of an operation happen to complete synchronously, for example due to the use of shared-memory bypass. This semantic relaxation allows removing significant overhead from the critical path of the implementation in such cases. We evaluate our results on three different representative systems using a combination of microbenchmarks and five variations of the the HPCChallenge RandomAccess benchmark implemented in UPC++ and run on a single node to accentuate the impact of locality. We find that in RMA versions of the benchmark written in a straightforward manner (without manually optimizing for locality), the new eager notification mode can provide up to a 25% speedup when synchronizing with promises and up to a 13.5x speedup when synchronizing with conjoined futures. We also evaluate our results using a graph matching application written with UPC++ RMA communication, where we measure overall speedups of as much as 11% in single-node runs of the unmodified application code, due to our transparent enhancements.

PAW-ATM'21

Katherine A. Yelick, Amir Kamil, Damian Rouson, Dan Bonachea, Paul H. Hargrove, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC21), Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), November 15, 2021,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. UPC++ offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between computation and asynchronous data movement. UPC++ supports simple/regular data structures as well as more elaborate distributed applications where communication is fine-grained and/or irregular. UPC++ provides a uniform abstraction for one-sided RMA between host and GPU/accelerator memories anywhere in the system. UPC++'s support for aggressive asynchrony enables applications to effectively overlap communication and reduce latency stalls, while the underlying GASNet-EX communication library delivers efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces UPC++, covering the memory and execution models and basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into application proxy examples. We examine a few UPC++ applications with irregular communication (metagenomic assembler and COVID-19 simulation) and describe how they utilize UPC++ to optimize communication performance.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001425, doi: 10.25344/S4XK53

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

Paul H. Hargrove, Dan Bonachea, Max Grossman, Amir Kamil, Colin A. MacLean, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'21)", Poster at Exascale Computing Project (ECP) Annual Meeting 2021, April 2021,

We present UPC++ and GASNet-EX, which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Memory Access (RMA) and Remote Procedure Call (RPC). The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems

Dan Bonachea, Amir Kamil, "UPC++ v1.0 Specification, Revision 2021.3.0", Lawrence Berkeley National Laboratory Tech Report, March 31, 2021, LBNL 2001388, doi: 10.25344/S4K881

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

Qiao Kang, Scot Breitenfeld, Kaiyuan Hou, Wei-keng Liao, Robert Ross, and Suren Byna,, "Optimizing Performance of Parallel I/O Accesses to Non-contiguous Blocks in Multiple Array Variables", IEEE BigData 2021 conference, December 19, 2021,

Tonglin Li, Suren Byna, Quincey Koziol, Houjun Tang, Jean Luca Bez, Qiao Kang, "h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns", Cray User Group (CUG) 2021, January 1, 2021,

Nabil Abubaker, Orhun Caglayan, M. Ozan Karsavuran, Cevdet Aykanat,, "Minimizing Staleness and Communication Overhead in Distributed SGD for Collaborative Filtering", IEEE Transactions on Computers, May 2023, doi: 10.1109/TC.2023.3275107

Nabil Abubaker, M. Ozan Karsavuran, Cevdet Aykanat, "Scaling Stratified Stochastic Gradient Descent for Distributed Matrix Completion", IEEE Transactions on Knowledge and Data Engineering, March 2023, doi: 10.1109/TKDE.2023.3253791

Mestan Firat Celiktug, M. Ozan Karsavuran, Seher Acer, Cevdet Aykanat, "Simultaneous Computational and Data Load Balancing in Distributed-Memory Setting", SIAM Journal on Scientific Computing, November 2022, 44(6):C399-C424, doi: 10.1137/22M1485772

Nabil Abubaker, M. Ozan Karsavuran, Cevdet Aykanat, "Scalable Unsupervised ML: Latency Hiding in Distributed Sparse Tensor Decomposition", IEEE Transactions on Parallel and Distributed Systems, November 2022, 33(11):3028-3040, doi: 10.1109/TPDS.2021.3128827

M. Ozan Karsavuran, Seher Acer, Cevdet Aykanat, Medium-Grain Partitioning for Sparse Tensor Decomposition, SIAM Conference on Computational Science and Engineering (CSE21), 2021,

M. Ozan Karsavuran, Seher Acer, Cevdet Aykanat, "Partitioning Models for General Medium-Grain Parallel Sparse Tensor Decomposition", IEEE Transactions on Parallel and Distributed Systems, January 2021, 32(1):147--159, doi: 10.1109/TPDS.2020.3012624

M Galloway, KJ Andersen, R Aurlien, R Banerji, M Bersanelli, S Bertocco, M Brilenkov, M Carbone, LPL Colombo, HK Eriksen, MK Foss, C Franceschet, U Fuskeland, S Galeotta, S Gerakakis, E Gjerløw, B Hensley, D Herman, M Iacobellis, M Ieronymaki, HT Ihle, JB Jewell, A Karakci, E Keihänen, R Keskitalo, G Maggio, D Maino, M Maris, S Paradiso, B Partridge, M Reinecke, A-S Suur-Uski, TL Svalheim, D Tavagnacco, H Thommesen, DJ Watts, IK Wehus, A Zacchei, BeyondPlanck III. Commander3, 2022,

M Galloway, M Reinecke, KJ Andersen, R Aurlien, R Banerji, M Bersanelli, S Bertocco, M Brilenkov, M Carbone, LPL Colombo, HK Eriksen, MK Foss, C Franceschet, U Fuskeland, S Galeotta, S Gerakakis, E Gjerløw, B Hensley, D Herman, M Iacobellis, M Ieronymaki, HT Ihle, JB Jewell, A Karakci, E Keihänen, R Keskitalo, G Maggio, D Maino, M Maris, S Paradiso, B Partridge, A-S Suur-Uski, TL Svalheim, D Tavagnacco, H Thommesen, DJ Watts, IK Wehus, A Zacchei, BeyondPlanck VIII. Efficient Sidelobe Convolution and Correction through Spin Harmonics, 2022,

TL Svalheim, KJ Andersen, R Aurlien, R Banerji, M Bersanelli, S Bertocco, M Brilenkov, M Carbone, LPL Colombo, HK Eriksen, MK Foss, C Franceschet, U Fuskeland, S Galeotta, M Galloway, S Gerakakis, E Gjerløw, B Hensley, D Herman, M Iacobellis, M Ieronymaki, HT Ihle, JB Jewell, A Karakci, E Keihänen, R Keskitalo, G Maggio, D Maino, M Maris, S Paradiso, B Partridge, M Reinecke, A-S Suur-Uski, D Tavagnacco, H Thommesen, DJ Watts, IK Wehus, A Zacchei, A Zonca, BeyondPlanck X. Bandpass and beam leakage corrections, 2022,

D Herman, B Hensley, KJ Andersen, R Aurlien, R Banerji, M Bersanelli, S Bertocco, M Brilenkov, M Carbone, LPL Colombo, HK Eriksen, MK Foss, C Franceschet, U Fuskeland, S Galeotta, M Galloway, S Gerakakis, E Gjerløw, M Iacobellis, M Ieronymaki, HT Ihle, JB Jewell, A Karakci, E Keihänen, R Keskitalo, G Maggio, D Maino, M Maris, S Paradiso, B Partridge, M Reinecke, A-S Suur-Uski, TL Svalheim, D Tavagnacco, H Thommesen, DJ Watts, IK Wehus, A Zacchei, BeyondPlanck XVI. Limits on Large-Scale Polarized Anomalous Microwave Emission from Planck LFI and WMAP, 2022,

KJ Andersen, R Aurlien, R Banerji, M Bersanelli, S Bertocco, M Brilenkov, M Carbone, LPL Colombo, HK Eriksen, MK Foss, C Franceschet, U Fuskeland, S Galeotta, M Galloway, S Gerakakis, E Gjerløw, B Hensley, D Herman, M Iacobellis, M Ieronymaki, HT Ihle, JB Jewell, A Karakci, E Keihänen, R Keskitalo, G Maggio, D Maino, M Maris, S Paradiso, B Partridge, M Reinecke, A-S Suur-Uski, TL Svalheim, D Tavagnacco, H Thommesen, M Tomasi, DJ Watts, IK Wehus, A Zacchei, BeyondPlanck XIV. Intensity foreground sampling, degeneracies and priors, 2022,

L Collaboration, E Allys, K Arnold, J Aumont, R Aurlien, S Azzoni, C Baccigalupi, AJ Banday, R Banerji, RB Barreiro, N Bartolo, L Bautista, D Beck, S Beckman, M Bersanelli, F Boulanger, M Brilenkov, M Bucher, E Calabrese, P Campeti, A Carones, FJ Casas, A Catalano, V Chan, K Cheung, Y Chinone, SE Clark, F Columbro, G D Alessandro, PD Bernardis, TD Haan, EDL Hoz, MD Petris, SD Torre, P Diego-Palazuelos, T Dotani, JM Duval, T Elleflot, HK Eriksen, J Errard, T Essinger-Hileman, F Finelli, R Flauger, C Franceschet, U Fuskeland, M Galloway, K Ganga, M Gerbino, M Gervasi, RT Génova-Santos, T Ghigna, S Giardiello, E Gjerløw, J Grain, F Grupp, A Gruppuso, JE Gudmundsson, NW Halverson, P Hargrave, T Hasebe, M Hasegawa, M Hazumi, S Henrot-Versillé, B Hensley, LT Hergt, D Herman, E Hivon, RA Hlozek, AL Hornsby, Y Hoshino, J Hubmayr, K Ichiki, T Iida, H Imada, H Ishino, G Jaehnig, N Katayama, A Kato, R Keskitalo, T Kisner, Y Kobayashi, A Kogut, K Kohri, E Komatsu, K Komatsu, K Konishi, N Krachmalnicoff, CL Kuo, L Lamagna, M Lattanzi, AT Lee, C Leloup, F Levrier, E Linder, G Luzzi, J Macias-Perez, B Maffei, D Maino, S Mandelli, E Martínez-González, S Masi, M Massa, S Matarrese, FT Matsuda, T Matsumura, L Mele, M Migliaccio, Y Minami, A Moggi, J Montgomery, L Montier, G Morgante, B Mot, Y Nagano, T Nagasaki, R Nagata, R Nakano, T Namikawa, F Nati, P Natoli, S Nerval, F Noviello, K Odagiri, S Oguri, H Ohsaki, L Pagano, A Paiella, D Paoletti, A Passerini, G Patanchon, F Piacentini, M Piat, G Polenta, D Poletti, T Prouvé, G Puglisi, D Rambaud, C Raum, S Realini, M Reinecke, M Remazeilles, A Ritacco, G Roudil, JA Rubino-Martin, M Russell, H Sakurai, Y Sakurai, M Sasaki, D Scott, Y Sekimoto, K Shinozaki, M Shiraishi, P Shirron, G Signorelli, F Spinella, S Stever, R Stompor, S Sugiyama, RM Sullivan, A Suzuki, TL Svalheim, E Switzer, R Takaku, H Takakura, Y Takase, A Tartari, Y Terao, J Thermeau, H Thommesen, KL Thompson, M Tomasi, M Tominaga, M Tristram, M Tsuji, M Tsujimoto, L Vacher, P Vielva, N Vittorio, W Wang, K Watanuki, IK Wehus, J Weller, B Westbrook, J Wilms, EJ Wollack, J Yumoto, M Zannoni, Probing Cosmic Inflation with the LiteBIRD Cosmic Microwave Background Polarization Survey, 2022,

DJ Watts, M Galloway, HT Ihle, KJ Andersen, R Aurlien, R Banerji, A Basyrov, M Bersanelli, S Bertocco, M Brilenkov, M Carbone, LPL Colombo, HK Eriksen, JR Eskilt, MK Foss, C Franceschet, U Fuskeland, S Galeotta, S Gerakakis, E Gjerløw, B Hensley, D Herman, M Iacobellis, M Ieronymaki, JB Jewell, A Karakci, E Keihänen, R Keskitalo, JGS Lunde, G Maggio, D Maino, M Maris, S Paradiso, B Partridge, M Reinecke, M San, NO Stutzer, A-S Suur-Uski, TL Svalheim, D Tavagnacco, H Thommesen, IK Wehus, A Zacchei, From BeyondPlanck to Cosmoglobe: Preliminary WMAP Q-band analysis, 2022,

P Diego-Palazuelos, JR Eskilt, Y Minami, M Tristram, RM Sullivan, AJ Banday, RB Barreiro, HK Eriksen, KM Górski, R Keskitalo, E Komatsu, E Martínez-González, D Scott, P Vielva, IK Wehus, "Cosmic Birefringence from the Planck Data Release 4", Physical review letters, 2022, 128:091302, doi: 10.1103/physrevlett.128.091302

Y Segawa, H Hirose, D Kaneko, M Hasegawa, S Adachi, P Ade, MAOA Faúndez, Y Akiba, K Arnold, J Avva, C Baccigalupi, D Barron, D Beck, S Beckman, F Bianchini, D Boettger, J Borrill, J Carron, S Chapman, K Cheung, Y Chinone, K Crowley, A Cukierman, T De Haan, M Dobbs, R Dunner, HE Bouhargani, T Elleflot, J Errard, G Fabbian, S Feeney, C Feng, T Fujino, N Galitzki, N Goeckner-Wald, J Groh, G Hall, N Halverson, T Hamada, M Hazumi, C Hill, L Howe, Y Inoue, J Ito, G Jaehnig, O Jeong, N Katayama, B Keating, R Keskitalo, S Kikuchi, T Kisner, N Krachmalnicoff, A Kusaka, AT Lee, D Leon, E Linder, LN Lowry, A Mangu, F Matsuda, Y Minami, J Montgomery, M Navaroli, H Nishino, J Peloton, ATP Pham, D Poletti, G Puglisi, C Raum, CL Reichardt, C Ross, M Silva-Feaver, P Siritanasak, R Stompor, A Suzuki, O Tajima, S Takakura, S Takatori, D Tanabe, GP Teply, C Tsai, C Verges, B Westbrook, Y Zhou, "Method for rapid performance validation of large TES bolometer array for POLARBEAR-2A using a coherent millimeter-wave source", AIP Conference Proceedings, 2021, 2319, doi: 10.1063/5.0038197

M Tristram, AJ Banday, KM Górski, R Keskitalo, CR Lawrence, KJ Andersen, RB Barreiro, J Borrill, HK Eriksen, R Fernandez-Cobos, TS Kisner, E Martínez-González, B Partridge, D Scott, TL Svalheim, H Thommesen, IK Wehus, "Planck constraints on the tensor-to-scalar ratio", Astronomy and Astrophysics, 2021, 647, doi: 10.1051/0004-6361/202039585

G Puglisi, R Keskitalo, T Kisner, JD Borrill, Simulating Calibration and Beam Systematics for a Future CMB Space Mission with the TOAST Package, Research Notes of the AAS, Pages: 137--137 2021, doi: 10.3847/2515-5172/ac0823

N Aghanim, Y Akrami, M Ashdown, J Aumont, C Baccigalupi, M Ballardini, AJ Banday, RB Barreiro, N Bartolo, S Basak, R Battye, K Benabed, JP Bernard, M Bersanelli, P Bielewicz, JJ Bock, JR Bond, J Borrill, FR Bouchet, F Boulanger, M Bucher, C Burigana, RC Butler, E Calabrese, JF Cardoso, J Carron, A Challinor, HC Chiang, J Chluba, LPL Colombo, C Combet, D Contreras, BP Crill, F Cuttaia, P De Bernardis, G De Zotti, J Delabrouille, JM Delouis, E DI Valentino, JM DIego, O Doré, M Douspis, A Ducout, X Dupac, S Dusini, G Efstathiou, F Elsner, TA Enßlin, HK Eriksen, Y Fantaye, M Farhang, J Fergusson, R Fernandez-Cobos, F Finelli, F Forastieri, M Frailis, AA Fraisse, E Franceschi, A Frolov, S Galeotta, S Galli, K Ganga, RT Génova-Santos, M Gerbino, T Ghosh, J González-Nuevo, KM Górski, S Gratton, A Gruppuso, JE Gudmundsson, J Hamann, W Handley, FK Hansen, D Herranz, SR Hildebrandt, E Hivon, Z Huang, AH Jaffe, WC Jones, A Karakci, E Keihänen, R Keskitalo, K Kiiveri, J Kim, TS Kisner, L Knox, N Krachmalnicoff, M Kunz, H Kurki-Suonio, G Lagache, JM Lamarre, A Lasenby, M Lattanzi, CR Lawrence, M Le Jeune, P Lemos, J Lesgourgues, F Levrier, A Lewis, M Liguori, "Erratum: Planck 2018 results: VI. Cosmological parameters (Astronomy and Astrophysics (2020) 641 (A6) DOI: 10.1051/0004-6361/201833910)", Astronomy and Astrophysics, 2021, 652, doi: 10.1051/0004-6361/201833910e

M Tristram, AJ Banday, KM Górski, R Keskitalo, CR Lawrence, KJ Andersen, RB Barreiro, J Borrill, LPL Colombo, HK Eriksen, R Fernandez-Cobos, TS Kisner, E Martínez-González, B Partridge, D Scott, TL Svalheim, IK Wehus, Improved limits on the tensor-to-scalar ratio using BICEP and Planck, 2021,

D. Bard, C. Snavely, L. Gerhardt, J. Lee, B. Totzke, K. Antypas, W. Arndt, J. Blaschke, S. Byna, R. Cheema, S. Cholia, M. Day, B. Enders, A. Gaur, A. Greiner, T. Groves, M. Kiran, Q. Koziol, T. Lehman, K. Rowland, C. Samuel, A. Selvarajan, A. Sim, D. Skinner, L. Stephey, R. Thomas, G. Torok, "LBNL Superfacility Project Report", Lawrence Berkeley National Laboratory, 2022, doi: 10.48550/arXiv.2206.11992

Qiang Du, Dan Wang, Tong Zhou, Antonio Gilardi, Mariam Kiran, Bashir Mohammed, Derun Li, and Russell Wilcox, "Experimental beam combining stabilization using machine learning trained while phases drift", Advanced Solid State Lasers 2022, © 2022 Optica Publishing Group, June 1, 2022, Vol. 30,:pp. 12639-, doi: https://doi.org/10.1364/OE.450255

Sugeerth Murugesan, Mariam Kiran, Bernd Hamann, Gunther H. Weber, "Netostat: Analyzing Dynamic Flow Patterns in High-Speed Networks", Cluster Computing, 2022, doi: 10.1007/s10586-022-03543-0

Shen Sheng, Mariam Kiran, Bashir Mohammed, "DynamicDeepFlow: An Approach for Identifying Changes in Network Traffic Flow Using Unsupervised Clustering", (BEST PAPER) 4th International Conference on Machine Learning for Networking (MLN'2021), December 6, 2021,

V. Dumont, C. Garner, A. Trivedi, C. Jones, V. Ganapati, J. Mueller, T. Perciano, M. Kiran, and M. Day, "HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization", 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), November 15, 2021,

Bashir Mohammed, Mariam Kiran, Bjoern Enders, "NetGraf: An End-to-End Learning Network Monitoring Service", 2021 IEEE Workshop on Innovating the Network for Data-Intensive Science (INDIS), November 15, 2021, doi: 10.1109/INDIS54524.2021.00007

B Mohammed, M Kiran; N Krishnaswamy; Keshang, Wu, "Predicting WAN Traffic Volumes using Fourier and Multivariate SARIMA Approach", International Journal of Big Data Intelligence, November 3, 2021, doi: 10.1504/IJBDI.2021.118742

M Kiran, B Mohammed, Q Du, D Wang, S Shen, R Wilcox, "Controlling Laser Beam Combining via an Active Reinforcement Learning Algorithm", Advanced Solid State Lasers 2021, Washington, DC United States, October 4, 2021,

Meriam Gay Bautista, Zhi Jackie Yao, Anastasiia Butko, Mariam Kiran, Mekena Metcalf, "Towards Automated Superconducting Circuit Calibration using Deep Reinforcement Learning", 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Tampa, FL, USA, IEEE, August 23, 2021, pp. 462-46, doi: 10.1109/ISVLSI51109.2021.00091

L Collaboration, E Allys, K Arnold, J Aumont, R Aurlien, S Azzoni, C Baccigalupi, AJ Banday, R Banerji, RB Barreiro, N Bartolo, L Bautista, D Beck, S Beckman, M Bersanelli, F Boulanger, M Brilenkov, M Bucher, E Calabrese, P Campeti, A Carones, FJ Casas, A Catalano, V Chan, K Cheung, Y Chinone, SE Clark, F Columbro, G D Alessandro, PD Bernardis, TD Haan, EDL Hoz, MD Petris, SD Torre, P Diego-Palazuelos, T Dotani, JM Duval, T Elleflot, HK Eriksen, J Errard, T Essinger-Hileman, F Finelli, R Flauger, C Franceschet, U Fuskeland, M Galloway, K Ganga, M Gerbino, M Gervasi, RT Génova-Santos, T Ghigna, S Giardiello, E Gjerløw, J Grain, F Grupp, A Gruppuso, JE Gudmundsson, NW Halverson, P Hargrave, T Hasebe, M Hasegawa, M Hazumi, S Henrot-Versillé, B Hensley, LT Hergt, D Herman, E Hivon, RA Hlozek, AL Hornsby, Y Hoshino, J Hubmayr, K Ichiki, T Iida, H Imada, H Ishino, G Jaehnig, N Katayama, A Kato, R Keskitalo, T Kisner, Y Kobayashi, A Kogut, K Kohri, E Komatsu, K Komatsu, K Konishi, N Krachmalnicoff, CL Kuo, L Lamagna, M Lattanzi, AT Lee, C Leloup, F Levrier, E Linder, G Luzzi, J Macias-Perez, B Maffei, D Maino, S Mandelli, E Martínez-González, S Masi, M Massa, S Matarrese, FT Matsuda, T Matsumura, L Mele, M Migliaccio, Y Minami, A Moggi, J Montgomery, L Montier, G Morgante, B Mot, Y Nagano, T Nagasaki, R Nagata, R Nakano, T Namikawa, F Nati, P Natoli, S Nerval, F Noviello, K Odagiri, S Oguri, H Ohsaki, L Pagano, A Paiella, D Paoletti, A Passerini, G Patanchon, F Piacentini, M Piat, G Polenta, D Poletti, T Prouvé, G Puglisi, D Rambaud, C Raum, S Realini, M Reinecke, M Remazeilles, A Ritacco, G Roudil, JA Rubino-Martin, M Russell, H Sakurai, Y Sakurai, M Sasaki, D Scott, Y Sekimoto, K Shinozaki, M Shiraishi, P Shirron, G Signorelli, F Spinella, S Stever, R Stompor, S Sugiyama, RM Sullivan, A Suzuki, TL Svalheim, E Switzer, R Takaku, H Takakura, Y Takase, A Tartari, Y Terao, J Thermeau, H Thommesen, KL Thompson, M Tomasi, M Tominaga, M Tristram, M Tsuji, M Tsujimoto, L Vacher, P Vielva, N Vittorio, W Wang, K Watanuki, IK Wehus, J Weller, B Westbrook, J Wilms, EJ Wollack, J Yumoto, M Zannoni, Probing Cosmic Inflation with the LiteBIRD Cosmic Microwave Background Polarization Survey, 2022,

Y Segawa, H Hirose, D Kaneko, M Hasegawa, S Adachi, P Ade, MAOA Faúndez, Y Akiba, K Arnold, J Avva, C Baccigalupi, D Barron, D Beck, S Beckman, F Bianchini, D Boettger, J Borrill, J Carron, S Chapman, K Cheung, Y Chinone, K Crowley, A Cukierman, T De Haan, M Dobbs, R Dunner, HE Bouhargani, T Elleflot, J Errard, G Fabbian, S Feeney, C Feng, T Fujino, N Galitzki, N Goeckner-Wald, J Groh, G Hall, N Halverson, T Hamada, M Hazumi, C Hill, L Howe, Y Inoue, J Ito, G Jaehnig, O Jeong, N Katayama, B Keating, R Keskitalo, S Kikuchi, T Kisner, N Krachmalnicoff, A Kusaka, AT Lee, D Leon, E Linder, LN Lowry, A Mangu, F Matsuda, Y Minami, J Montgomery, M Navaroli, H Nishino, J Peloton, ATP Pham, D Poletti, G Puglisi, C Raum, CL Reichardt, C Ross, M Silva-Feaver, P Siritanasak, R Stompor, A Suzuki, O Tajima, S Takakura, S Takatori, D Tanabe, GP Teply, C Tsai, C Verges, B Westbrook, Y Zhou, "Method for rapid performance validation of large TES bolometer array for POLARBEAR-2A using a coherent millimeter-wave source", AIP Conference Proceedings, 2021, 2319, doi: 10.1063/5.0038197

M Tristram, AJ Banday, KM Górski, R Keskitalo, CR Lawrence, KJ Andersen, RB Barreiro, J Borrill, HK Eriksen, R Fernandez-Cobos, TS Kisner, E Martínez-González, B Partridge, D Scott, TL Svalheim, H Thommesen, IK Wehus, "Planck constraints on the tensor-to-scalar ratio", Astronomy and Astrophysics, 2021, 647, doi: 10.1051/0004-6361/202039585

G Puglisi, R Keskitalo, T Kisner, JD Borrill, Simulating Calibration and Beam Systematics for a Future CMB Space Mission with the TOAST Package, Research Notes of the AAS, Pages: 137--137 2021, doi: 10.3847/2515-5172/ac0823

N Aghanim, Y Akrami, M Ashdown, J Aumont, C Baccigalupi, M Ballardini, AJ Banday, RB Barreiro, N Bartolo, S Basak, R Battye, K Benabed, JP Bernard, M Bersanelli, P Bielewicz, JJ Bock, JR Bond, J Borrill, FR Bouchet, F Boulanger, M Bucher, C Burigana, RC Butler, E Calabrese, JF Cardoso, J Carron, A Challinor, HC Chiang, J Chluba, LPL Colombo, C Combet, D Contreras, BP Crill, F Cuttaia, P De Bernardis, G De Zotti, J Delabrouille, JM Delouis, E DI Valentino, JM DIego, O Doré, M Douspis, A Ducout, X Dupac, S Dusini, G Efstathiou, F Elsner, TA Enßlin, HK Eriksen, Y Fantaye, M Farhang, J Fergusson, R Fernandez-Cobos, F Finelli, F Forastieri, M Frailis, AA Fraisse, E Franceschi, A Frolov, S Galeotta, S Galli, K Ganga, RT Génova-Santos, M Gerbino, T Ghosh, J González-Nuevo, KM Górski, S Gratton, A Gruppuso, JE Gudmundsson, J Hamann, W Handley, FK Hansen, D Herranz, SR Hildebrandt, E Hivon, Z Huang, AH Jaffe, WC Jones, A Karakci, E Keihänen, R Keskitalo, K Kiiveri, J Kim, TS Kisner, L Knox, N Krachmalnicoff, M Kunz, H Kurki-Suonio, G Lagache, JM Lamarre, A Lasenby, M Lattanzi, CR Lawrence, M Le Jeune, P Lemos, J Lesgourgues, F Levrier, A Lewis, M Liguori, "Erratum: Planck 2018 results: VI. Cosmological parameters (Astronomy and Astrophysics (2020) 641 (A6) DOI: 10.1051/0004-6361/201833910)", Astronomy and Astrophysics, 2021, 652, doi: 10.1051/0004-6361/201833910e

H. Klion, R. Jambunathan, M. E. Rowan, E. Yang, D. Willcox, J.-L. Vay, R. Lehe, A. Myers, A. Huebl, W. Zhang, "Particle-in-Cell Simulations of Relativistic Magnetic Reconnection with Advanced Maxwell Solver Algorithms", The Astrophysical Journal, July 13, 2023, 952,

Hannah Klion, Alexander Tchekhovskoy, Daniel Kasen, Adithan Kathirgamaraju, Eliot Quataert, Rodrigo Fernandez, "The impact of r-process heating on the dynamics of neutron star merger accretion disc winds and their electromagnetic radiation", Monthly Notices of the RAS, 2022, 510:2968-2979, doi: 10.1093/mnras/stab3583

Hannah Klion, Paul C. Duffell, Daniel Kasen, Eliot Quataert, "The effect of jet-ejecta interaction on the viewing angle dependence of kilonova light curves", Monthly Notices of the RAS, 2021, 502:865-875, doi: 10.1093/mnras/stab042

Daniel R. Ladiges, Sean P. Carney, Andrew Nonaka, Katherine Klymko, Guy C. Moore, Alejandro L. Garcia, Sachin R. Natesh, Aleksandar Donev, John B. Bell, "A Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm for Modeling Electrolytes", Physical Review Fluids, April 1, 2021, 6(4):044309,

Melissa L. Graham, Robert A. Knop, Thomas Kennedy, Peter E. Nugent, Eric Bellm, Márcio Catelan, Avi Patel, Hayden Smotherman, Monika Soraisam, Steven Stetzler, Lauren N. Aldoroty, Autumn Awbrey, Karina Baeza-Villagra, Pedro H. Bernardinelli, Federica Bianco, Dillon Brout, Riley Clarke, William I. Clarkson, Thomas Collett, James R. A. Davenport, Shenming Fu, John E. Gizis, Ari Heinze, Lei Hu, Saurabh W. Jha, Mario Jurić, J. Bryce Kalmbach, Alex Kim, Chien-Hsiu Lee, Chris Lidman, Mark Magee, Clara E. Martínez-Vázquez, Thomas Matheson, Gautham Narayan, Antonella Palmese, Christopher A. Phillips, Markus Rabus, Armin Rest, Nicolás Rodríguez-Segovia, Rachel Street, A. Katherina Vivas, Lifan Wang, Nicholas Wolf, Jiawen Yang, "Deep drilling in the time domain with DECam: Survey characterization", Monthly Notices of the Royal Astronomical Society, November 2022,

Venkitesh Ayyar, Robert Knop, Autumn Awbrey, Alexis Andersen, Peter Nugent, "Identifying Transient Candidates in the Dark Energy Survey Using Convolutional Neural Networks", Publications of the Astronomical Society of the Pacific, September 2022, 134:094501,

The ability to discover new transient candidates via image differencing without direct human intervention is an important task in observational astronomy. For these kind of image classification problems, machine learning techniques such as Convolutional Neural Networks (CNNs) have shown remarkable success. In this work, we present the results of an automated transient candidate identification on images with CNNs for an extant data set from the Dark Energy Survey Supernova program, whose main focus was on using Type Ia supernovae for cosmology. By performing an architecture search of CNNs, we identify networks that efficiently select non-artifacts (e.g., supernovae, variable stars, AGN, etc.) from artifacts (image defects, mis-subtractions, etc.), achieving the efficiency of previous work performed with random Forests, without the need to expend any effort in feature identification. The CNNs also help us identify a subset of mislabeled images. Performing a relabeling of the images in this subset, the resulting classification with CNNs is significantly better than previous results, lowering the false positive rate by 27% at a fixed missed detection rate of 0.05.

G Koolstra, N Stevenson, S Barzili, L Burns, K Siva, S Greenfield, W Livingston, A Hashim, RK Naik, JM Kreikebaum, KP O'Brien, DI Santiago, J Dressel, I Siddiqi, "Monitoring fast superconducting qubit dynamics using a neural network", Preprint, August 2021,

Élie Genois, Jonathan A. Gross, Agustin Di Paolo, Noah J. Stevenson, Gerwin Koolstra, Akel Hashim, Irfan Siddiqi, Alexandre Blais, "Quantum-tailored machine-learning characterization of a superconducting qubit", Preprint, June 24, 2021,

P. Kumar, A. Nonaka, R. Jambunathan, G. Pahwa, S. Salahuddin, and Z. Yao, "FerroX: A GPU-accelerated, 3D Phase-Field Simulation Framework for Modeling Ferroelectric Devices", Computer Physics Communications, September 1, 2023, 290:108757,

Alex Doe, Jane Doe, Dianna LaFerry, John Smith, "Test Title for Sample Publication", Conference, April 22, 2023, No.1:555-600,

This is a test publication for the purposes of explaining the SilverStripe 4 local publications database. It is intended as a guidepost for users and does not contain any relevant scientific information. All authors, titles, and dates are fictitious.

A. L. Garcia, J. B. Bell, A. Nonaka, I. Srivastava, D. Ladiges, C. Kim, "An Introduction to Computational Fluctuating Hydrodynamics", June 18, 2024, doi: https://doi.org/10.48550/arXiv.2406.12157

J. G. Wang, D. R. Ladiges, I. Srivastava, S. P. Carney, A. J. Nonaka, A. L. Garcia, J. B. Bell, "Steric effects in induced-charge electro-osmosis for strong electric fields", Physical Review Fluids, August 29, 2023, 8:083702,

I. Srivastava, D. R. Ladiges, A. Nonaka, A. L. Garcia, J. B. Bell, "Staggered Scheme for the Compressible Fluctuating Hydrodynamics of Multispecies Fluid Mixtures", Physical Review E, January 24, 2023, 107:015305, doi: 10.1103/PhysRevE.107.015305

D. R. Ladiges, J. G. Wang, I. Srivastava, S. P. Carney, A. Nonaka, A. L. Garcia, A. Donev, J. B. Bell, "Modeling Electrokinetic Flows with the Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm", Physical Review E, November 19, 2022, 106:035104, doi: 10.1103/PhysRevE.106.035104

Robin J Dolleman, Debadi Chakraborty, Daniel R Ladiges, Herre SJ van der Zant, John E Sader, Peter G Steeneken, "Squeeze-film effect on atomically thin resonators in the high-pressure limit", Submitted to Nano Letters, June 24, 2021,

Daniel R. Ladiges, Sean P. Carney, Andrew Nonaka, Katherine Klymko, Guy C. Moore, Alejandro L. Garcia, Sachin R. Natesh, Aleksandar Donev, John B. Bell, "A Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm for Modeling Electrolytes", Physical Review Fluids, April 1, 2021, 6(4):044309,

Alexander Anferov, Shannon P. Harvey, Fanghui Wan, Kan-Heng Lee, Jonathan Simon, David I. Schuster, "Low-loss Millimeter-wave Resonators with an Improved Coupling Structure", arXiv.org, 2023,

Alexander Anferov, Kan-Heng Lee, Fang Zhao, Jonathan Simon, David I. Schuster, "Improved Coherence in Optically-Defined Niobium Trilayer Junction Qubits", arXiv.org, 2023,

Ziqian Li, Tanay Roy, David Rodriguez Perez, Kan-Heng Lee, Eliot Kapit, David I. Schuster, "Autonomous error correction of a single logical qubit using two transmons", arXiv.org, 2023,

Meghna Bhattacharya, others, Portability: A Necessary Approach for Future Scientific Software, 2022 Snowmass Summer Study, 2022,

Christopher D. Jones, Kyle Knoepfel, Paolo Calafiura, Charles Leggett, Vakhtang Tsulaia, Evolution of HEP Processing Frameworks, 2022 Snowmass Summer Study, 2022,

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

X. Li, Y. Liu, P. Lin, P. Sao, "Newly released capabilities in distributed-memory SuperLU sparse direct solver", ACM Transactions on Mathematical Software, November 19, 2022,

Download File: 3577197.pdf (pdf: 1.1 MB)

Hengrui Luo, Younghyun Cho, James W. Demmel, Xiaoye S. Li, Yang Liu, "Hybrid models for mixed variables in Bayesian optimization", June 6, 2022,

X. Zhu, Y. Liu, P. Ghysels, D. Bindal, X. S. Li, "GPTuneBand: multi-task and multi-fidelity Bayesian optimization for autotuning large-scale high performance computing applications", SIAM PP, February 23, 2022,

Download File: GPTuneBand.pdf (pdf: 1.4 MB)

Y. Cho, J. W. Demmel, X. S. Li, Y. Liu, H. Luo, "Enhancing autotuning capability with a history database", IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), December 20, 2021,

Download File: GPTuneHistoryDB.pdf (pdf: 390 KB)

H. Luo, J.W. Demmel, Y. Cho, X. S. Li, Y. Liu, "Non-smooth Bayesian optimization in tuning problems", arxiv-preprint, September 21, 2021,

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

Yang Liu, Pieter Ghysels, Lisa Claus, Xiaoye Sherry Li, "Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations", SIAM J. Sci. Comput., June 22, 2021,

Yang Liu, Xin Xing, Han Guo, Eric Michielssen, Pieter Ghysels, Xiaoye Sherry Li, "Butterfly factorization via randomized matrix-vector multiplications", SIAM J. Sci. Comput., March 9, 2021,

Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, X. S. Li, "GPTune: multitask learning for autotuning exascale applications", PPoPP, February 17, 2021, doi: 10.1145/3437801.3441621

Tim Kneafsey, David Trebotich, Terry Ligocki, "Direct Numerical Simulation of Flow Through Nanoscale Shale Pores in a Mesoscale Sample", Album of Porous Media, edited by E.F. Médici, A.D. Otero, (Springer Cham: April 14, 2023) Pages: 87 doi: https://doi.org/10.1007/978-3-031-23800-0_69

David Trebotich, Terry Ligocki, "High Resolution Simulation of Fluid Flow in Press Felts Used in Paper Manufacturing", Album of Porous Media, edited by E.F. Médici, A.D. Otero, (Springer Cham: April 14, 2023) Pages: 132 doi: https://doi.org/10.1007/978-3-031-23800-0_109

Diyi Liu, Weijie Du, Lin Lin, James P. Vary, Chao Yang, "An efficient quantum circuit for block encoding a pairing Hamiltonian", Journal of Computational Science, February 1, 2025, 85:102480, doi: 10.1016/j.jocs.2024.102480

Gunhee Park, Zhen Huang, Yuanran Zhu, Chao Yang, Garnet Kin-Lic Chan, Lin Lin, "Quasi-Lindblad pseudomode theory for open quantum systems", Physical Review B, November 25, 2024, 110:195148, doi: 10.1103/PhysRevB.110.195148

Daan Camps, Lin Lin, Roel Van Beeumen, Chao Yang, "Explicit quantum circuits for block encodings of certain sparse matrices", SIAM Journal on Matrix Analysis and Applications, January 1, 2024, 45:801-827, doi: 10.1137/22M1484298

Shizhe Jiao, Zhenlin Zhang, Kai Wu, Lingyun Wan, Huanhuan Ma, Jielan Li, Sheng Chen, Xinming Qin, Jie Liu, Zijing Ding, Jinlong Yang, Yingzhou Li, Wei Hu, Lin Lin, Chao Yang, "KSSOLV 2.0: An efficient MATLAB toolbox for solving the Kohn-Sham equations with plane-wave basis set", Computer Physics Communications, October 1, 2022, 279:108424, doi: 10.1016/j.cpc.2022.108424

Mustafa Mutiur Rahman, Zhe Bai, Jacob Robert King, Carl R. Sovinec, Xishuo Wei, Samuel Williams, Yang Liu, "Sparsified time-dependent Fourier neural operators for fusion simulations", Phys. Plasmas, December 4, 2024, 31:12, doi: 10.1063/5.0232503

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

X. Li, Y. Liu, P. Lin, P. Sao, "Newly released capabilities in distributed-memory SuperLU sparse direct solver", ACM Transactions on Mathematical Software, November 19, 2022,

Download File: 3577197.pdf (pdf: 1.1 MB)

M. Wang, Y. Liu, P. Ghysels, A. C. Yucel, "VoxImp: Impedance Extraction Simulator for Voxelized Structures", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2, 2022, doi: 10.1109/TCAD.2022.3218768

Yang Liu, Jian Song, Robert Burridge, Jianliang Qian, "A Fast Butterfly-compressed Hadamard-Babich Integrator for High-Frequency Helmholtz Equations in Inhomogeneous Media with Arbitrary Sources", SIAM Multiscale Modeling and Simulation, October 6, 2022,

Download File: 2210-v2.02698.pdf (pdf: 38 MB)

Hengrui Luo, Younghyun Cho, James W. Demmel, Xiaoye S. Li, Yang Liu, "Hybrid models for mixed variables in Bayesian optimization", June 6, 2022,

Yang Liu, "A comparative study of butterfly-enhanced direct integral and differential equation solvers for high-frequency electromagnetic analysis involving inhomogeneous dielectrics", May 29, 2022,

Download File: comparative_study-v2.pdf (pdf: 3.3 MB)

X. Zhu, Y. Liu, P. Ghysels, D. Bindal, X. S. Li, "GPTuneBand: multi-task and multi-fidelity Bayesian optimization for autotuning large-scale high performance computing applications", SIAM PP, February 23, 2022,

Download File: GPTuneBand.pdf (pdf: 1.4 MB)

Y. Cho, J. W. Demmel, X. S. Li, Y. Liu, H. Luo, "Enhancing autotuning capability with a history database", IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), December 20, 2021,

Download File: GPTuneHistoryDB.pdf (pdf: 390 KB)

S. B. Sayed, Y. Liu, L. J. Gomez, A. C. Yucel, "A butterfly-accelerated volume integral equation solver for broad permittivity and large-scale electromagnetic analysis", arxiv-preprint, November 5, 2021,

H. Luo, J.W. Demmel, Y. Cho, X. S. Li, Y. Liu, "Non-smooth Bayesian optimization in tuning problems", arxiv-preprint, September 21, 2021,

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

Yang Liu, Pieter Ghysels, Lisa Claus, Xiaoye Sherry Li, "Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations", SIAM J. Sci. Comput., June 22, 2021,

Yang Liu, Xin Xing, Han Guo, Eric Michielssen, Pieter Ghysels, Xiaoye Sherry Li, "Butterfly factorization via randomized matrix-vector multiplications", SIAM J. Sci. Comput., March 9, 2021,

Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, X. S. Li, "GPTune: multitask learning for autotuning exascale applications", PPoPP, February 17, 2021, doi: 10.1145/3437801.3441621

E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, Dave Pugmire, Silvio Rizzi, Thompson, Will Usher, Gunther H. Weber, Brad Whitlock, Wolf, Kesheng Wu, "Proximity Portability and In Transit, M-to-N Data Partitioning and Movement in SENSEI", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_20

E. Wes Bethel, Burlen Loring, Utkarsh Ayatchit, David Camp, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, Kesheng Wu, "The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_13

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power, "SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems", Proceedings of the 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED), September 2022,

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power,, "Enabling Design Space Exploration for RISC-V Secure Compute Environments", Proceedings of the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV), (co-located with ISCA 2021), June 17, 2021,

Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert, "Performance Analysis of Scientific Computing Workloads on General Purpose TEEs", Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IEEE, May 2021, doi: 10.1109/IPDPS49936.2021.00115

Jean Sexton, Zarija Lukic, Ann Almgren, Chris Daley, Brian Friesen, Andrew Myers, and Weiqun Zhang, "Nyx: A Massively Parallel AMR Code for Computational Cosmology", The Journal Of Open Source Software, July 10, 2021,

Timur Takhtaganov, Zarija Lukić, Juliane Mueller, Dmitriy Morozov, "Cosmic Inference: Constraining Parameters With Observations and Highly Limited Number of Simulations", Astrophysical Journal, 2021, 906:74, doi: 10.3847/1538-4357/abc8ed

Meriam Gay Bautista, Darren Lyles, Kylie Huch, Patricia Gonzalez-Guerrero, George Michelogiannakis, "Area Efficient Asynchronous SFQ Pulse Round-Robin Distribution Network", IEEE Transactions on Circuits and Systems I: Regular Papers, November 2023,

Kylie Huch, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Hyperdimensional Associative Memory Circuit for Scalable Machine Learning", IEEE Transactions on Applied Superconductivity, May 2023,

Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Gay Bautista, George Michelogiannakis, "PaST-NoC: A Packet-Switched Superconducting Temporal NoC", IEEE Transactions on Applied Superconductivity, January 2023,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-Flux Shift Register for Race Logic and Its Applications", IEEE Transactions on Circuits and Systems I: Regular Papers, October 2022,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, Kylie Huch, George Michelogiannakis, "Superconducting Digital DIT Butterfly Unit for Fast Fourier Transform Using Race Logic", 2022 20th IEEE Interregional NEWCAS Conference (NEWCAS), IEEE, June 2022, 441-445,

Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, George Michelogiannakis, "Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators", 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), ACM, February 2022,

Download File: asplos2022.pdf (pdf: 1.9 MB)

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-flux Shift Buffer for Race Logic", 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), August 2021,

George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021,

Paul H. Hargrove, Dan Bonachea, Johnny Corbino, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'23)", Poster at Exascale Computing Project (ECP) Annual Meeting 2023, January 2023,

The Pagoda project is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. The first component is GASNet-EX, a portable, high-performance, global-address-space communication library. The second component is UPC++, a C++ template library. Together, these libraries enable agile, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

GASNet-EX is a portable, high-performance communications middleware library which leverages hardware support to implement Remote Memory Access (RMA) and Active Message communication primitives. GASNet-EX supports a broad ecosystem of alternative HPC programming models, including UPC++, Legion, Chapel and multiple implementations of UPC and Fortran Coarrays. GASNet-EX is implemented directly over the native APIs for networks of interest in HPC. The tight semantic match of GASNet-EX APIs to the client requirements and hardware capabilities often yields better performance than competing libraries.

UPC++ provides high-level productivity abstractions appropriate for Partitioned Global Address Space (PGAS) programming such as: remote memory access (RMA), remote procedure call (RPC), support for accelerators (e.g. GPUs), and mechanisms for aggressive asynchrony to hide communication costs. UPC++ implements communication using GASNet-EX, delivering high performance and portability from laptops to exascale supercomputers. HPC application software using UPC++ includes: MetaHipMer2 metagenome assembler, SIMCoV viral propagation simulation, NWChemEx TAMM, and graph computation kernels from ExaGraph.

Mateusz Pusz, Gašper Ažman, Bengt Gustafsson, Colin MacLean, Corentin Jabot, "Universal Template Parameters", ISO C++ Standard Mailing, September 2022,

This paper proposes a unified model for universal template parameters (UTPs) and dependent names, enabling more comprehensive and consistent template metaprogramming. Universal template parameters allow for a generic apply and other higher-order template metafunctions, including certain type traits.

Paul H. Hargrove, Dan Bonachea, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'22)", Poster at Exascale Computing Project (ECP) Annual Meeting 2022, May 5, 2022,

We present UPC++ and GASNet-EX, distributed libraries which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Procedure Call (RPC) and for Remote Memory Access (RMA) to host and GPU memories. The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems.

Daniel Waters, Colin A. MacLean, Dan Bonachea, Paul H. Hargrove, "Demonstrating UPC++/Kokkos Interoperability in a Heat Conduction Simulation (Extended Abstract)", Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2021, doi: 10.25344/S4630V

We describe the replacement of MPI with UPC++ in an existing Kokkos code that simulates heat conduction within a rectangular 3D object, as well as an analysis of the new code’s performance on CUDA accelerators. The key challenges were packing the halos in Kokkos data structures in a way that allowed for UPC++ remote memory access, and streamlining synchronization costs. Additional UPC++ abstractions used included global pointers, distributed objects, remote procedure calls, and futures. We also make use of the device allocator concept to facilitate data management in memory with unique properties, such as GPUs. Our results demonstrate that despite the algorithm’s good semantic match to message passing abstractions, straightforward modifications to use UPC++ communication deliver vastly improved performance and scalability in the common case. We find the one-sided UPC++ version written in a natural way exhibits good performance, whereas the message-passing version written in a straightforward way exhibits performance anomalies. We argue this represents a productivity benefit for one-sided communication models.

PAW-ATM'21

Paul H. Hargrove, Dan Bonachea, Colin A. MacLean, Daniel Waters, "GASNet-EX Memory Kinds: Support for Device Memory in PGAS Programming Models", The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'21) Research Poster, November 2021, doi: 10.25344/S4P306

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work includes two major components: UPC++ (a C++ template library) and GASNet-EX (a portable, high-performance communication library). This poster describes recent advances in GASNet-EX to efficiently implement Remote Memory Access (RMA) operations to and from memory on accelerator devices such as GPUs. Performance is illustrated via benchmark results from UPC++ and the Legion programming system, both using GASNet-EX as their communications library.

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Paul H. Hargrove, Dan Bonachea, Max Grossman, Amir Kamil, Colin A. MacLean, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'21)", Poster at Exascale Computing Project (ECP) Annual Meeting 2021, April 2021,

We present UPC++ and GASNet-EX, which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Memory Access (RMA) and Remote Procedure Call (RPC). The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems

I. Mahmud, P. Zuk, C. Wang, M. Kiran, K. Wu, K. Thareja, K. Raghavan, A. Mandal, E. Deelman, "DISTRI: Development and Integration of Simulation Tools for Resilient Infrastructure", 5th International Workshop on Big Data & AI Tools, Models, and Use Cases for Innovative Scientific Discovery (BTSD), 2024,

P. Zuk, H. Jin, I. Mahmud, K. Raghavan, K. Thareja, S. Wu, P. Balaprakash, F, Cappello, Z. Chen, E. Deelman, S. Di, A. Hamade, M. Kiran, A. Mandal, E. Scott, C. Wang, K. Wu, SWARM: Scientific Workflow Applications on Resilient Metasystem, ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), BoF, 2024,

P. Zuk, H. Jin, I. Mahmud, K. Raghavan, K. Thareja, S. Wu, P. Balaprakash, F, Cappello, Z. Chen, E. Deelman, S. Di, A. Hamade, M. Kiran, A. Mandal, E. Scott, C. Wang, K. Wu, "SWARM: Scientific Workflow Applications on Resilient Metasystem", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), 2024,

Shakila Shafiq, Md. Sazzadur Rahman, Shamim Ahmed Shaon, Imtiaz Mahmud, A. S. M. Sanwar Hosen, "A Review on Software-Defined Networking for Internet of Things Inclusive of Distributed Computing, Blockchain, and Mobile Network Technology: Basics, Trends, Challenges, and Future Research Potentials", International Journal of Distributed Sensor Networks, August 13, 2024, doi: 10.1155/2024/9006405

Imtiaz Mahmud, Mariam Kiran, Ewa Deelman, Anirban Mandal, Prasanna Balaprakash, Krishnan Raghavan, Hongwei Jin, Cong Wang, Komal Thareja, George Papadimitriou, Investigating BBRv3’s Performance in Large Science File Transfer on FABRIC, KNIT’8 Workshop, San Diego, CA, USA, March 19, 2024,

I. Mahmud, G. Papadimitriou, G. Wang, M. Kiran, A. Mandal, E. Deelman, "Elephants Sharing the Highway: Studying TCP Fairness in Large Transfers over High Throughput Links", 10th International Workshop on Innovating the Network for Data Intensive Science (INDIS 2023), 2023, doi: 10.1145/3624062.3624594

Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, X. S. Li, "GPTune: multitask learning for autotuning exascale applications", PPoPP, February 17, 2021, doi: 10.1145/3437801.3441621

James F. O'Neill, Tamsin L. Edwards, Daniel F. Martin, Courtney Shafer, Stephen L. Cornford, Hélène L. Seroussi, Sophie Nowicki, Mira Adhikari, Lauren J. Gregoire, "ISMIP6-based Antarctic projections to 2100: simulations with the BISICLES ice sheet model", The Cryosphere, February 4, 2025, 19:541-563, doi: 10.5194/tc-19-541-2025

Will Thacher, Hans Johansen, Daniel Martin, "A high order cut-cell method for solving the shallow-shelf equations", Journal of Computational Science, August 1, 2024, 80, doi: 10.1016/j.jocs.2024.102319

Lois Curfman McInnes, Paige Kinsley, Daniel Martin, Suzanne Parete-Koon, Sreeranjani (Jini) Ramprakash, "Building a Diverse and Inclusive HPC Community for Mission-Driven Team Science", Computing in Science & Engineering, April 12, 2024, 25:5:31-38, doi: 10.1109/MCSE.2023.3348943

Daniel F. Martin, Steven B. Roberts, Hans Johansen, David J Gardner, Carol S Woodward, "Impacts of improved time evolution in BISICLES using SUNDIALS", December 14, 2023,

Download File: AGU2023Sundials.pdf (pdf: 1 MB)

Duncan Carpenter, Anjali Sandip, Samuel Kachuck, Daniel Martin, "Does Damaged Ice affect Ice Sheet Evolution?", American Geophysical Union Fall Meeting, December 14, 2023,

Download File: CarpenterAGU2023.pdf (pdf: 3.1 MB)

Will Thacher and Hans Johansen and Daniel Martin, "A high order Cartesian grid, finite volume method for elliptic interface problems", Journal of Computational Physics, October 15, 2023, 491, doi: 10.1016/j.jcp.2023.112351

S. Bevan, S. Cornford, L. Gilbert, I. Otosaka, D. Martin, T. Surawy-Stepney, "Amundsen Sea Embayment ice-sheet mass-loss predictions to 2050 calibrated using observations of velocity and elevation change", Journal of Glaciology, August 14, 2023, 1-11, doi: 10.1017/jog.2023.57

Daniel Martin, Samuel Kachuck, Joanna Millstein, Brent Minchew, "Examining the Sensitivity of Ice Sheet Models to Updates in Rheology (n=4)", AGU Fall Meeting, December 15, 2022,

Download File: AGU2022-1.pdf (pdf: 508 KB)

Anne M. Felden, Daniel F. Martin, Esmond G. Ng, "SUHMO: an AMR SUbglacial Hydrology MOdel v1.0", Geosci. Model Dev. Discuss., July 27, 2022,

Download File: gmd-2022-190.pdf (pdf: 5.5 MB)

Samuel B. Kachuck, Morgan Whitcomb, Jeremy N. Bassis, Daniel F. Martin, Stephen F. Price, "Simulating ice-shelf extent using damage mechanics", Journal of Glaciology, March 7, 2022, 68(271):987-998, doi: 10.1017/jog.2022.12

Samuel Benjamin Kachuck, Morgan Whitcomb, Jeremy N Bassis, Daniel F Martin, and Stephen F Price,, "When are (simulations of) ice shelves stable? Stabilizing forces in fracture-permitting models", AGU Fall Meeting, December 16, 2021,

Daniel F. Martin, Stephen L. Cornford, Esmond G. Ng, Impact of Improved Bedrock Geometry and Basal Friction Relations on Antarctic Vulnerability to Regional Ice Shelf Collapse, Americal Geophysical Union Fall Meeting, December 15, 2021,

Courtney Shafer, Daniel F Martin and Esmond G Ng, "Comparing the Shallow-Shelf and L1L2 Approximations using BISICLES in the Context of MISMIP+ with Buttressing Effects", AGU Fall Meeting, December 13, 2021,

Anne M. Felden, Daniel F. Martin, Esmond G. Ng, SUHMO: An SUbglacial Hydrology MOdel based on the Chombo AMR framework, American Geophysical Union Fall Meeting, December 13, 2021,

Thomas M Evans, Andrew Siegel, Erik W Draeger,Jack Deslippe, Marianne M Francois, Timothy C Germann,William E Hart, Daniel F Martin, "A survey of software implementations used by application codes in the Exascale Computing Project", The International Journal of High Performance Computing Applications, June 25, 2021, doi: https://doi.org/10.1177/10943420211028940

Download File: ijhpc-2021.pdf (pdf: 242 KB)

Tamsin L. Edwards, Sophie Nowicki, Ben Marzeion, Regine Hock, Heiko Goelzer, Hélène Seroussi, Nicolas C. Jourdain, Donald A. Slater, Fiona E. Turner, Christopher J. Smith, Christine M. McKenna, Erika Simon, Ayako Abe-Ouchi, Jonathan M. Gregory, Eric Larour, William H. Lipscomb, Antony J. Payne, Andrew Shepherd, Cécile Agosta, Patrick Alexander, Torsten Albrecht, Brian Anderson, Xylar Asay-Davis, Andy Aschwanden, Alice Barthel, Andrew Bliss, Reinhard Calov, Christopher Chambers, Nicolas Champollion, Youngmin Choi, Richard Cullather, Joshua Cuzzone, Christophe Dumas, Denis Felikson, Xavier Fettweis, Koji Fujita, Benjamin K. Galton-Fenzi, Rupert Gladstone, Nicholas R. Golledge, Ralf Greve, Tore Hattermann, Matthew J. Hoffman, Angelika Humbert, Matthias Huss, Philippe Huybrechts, Walter Immerzeel, Thomas Kleiner, Philip Kraaijenbrink, Sébastien Le clec’h, Victoria Lee, Gunter R. Leguy, Christopher M. Little, Daniel P. Lowry, Jan-Hendrik Malles, Daniel F. Martin, Fabien Maussion, Mathieu Morlighem, James F. O’Neill, Isabel Nias, Frank Pattyn, Tyler Pelle, Stephen F. Price, Aurélien Quiquet, Valentina Radić, Ronja Reese, David R. Rounce, Martin Rückamp, Akiko Sakai, Courtney Shafer, Nicole-Jeanne Schlegel, Sarah Shannon, Robin S. Smith, Fiammetta Straneo, Sainan Sun, Lev Tarasov, Luke D. Trusel, Jonas Van Breedam, Roderik van de Wal, Michiel van den Broeke, Ricarda Winkelmann, Harry Zekollari, Chen Zhao, Tong Zhang, Thomas Zwinger, "Projected land ice contributions to twenty-first-century sea level rise", Nature, May 5, 2021, 593:74-82, doi: 10.1038/s41586-021-03302-y

Download File: Edwards-et-al-2021-Nature-preprint.pdf (pdf: 40 MB)

Meriam Gay Bautista, Zhi Jackie Yao, Anastasiia Butko, Mariam Kiran, Mekena Metcalf, "Towards Automated Superconducting Circuit Calibration using Deep Reinforcement Learning", 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Tampa, FL, USA, IEEE, August 23, 2021, pp. 462-46, doi: 10.1109/ISVLSI51109.2021.00091

Georgios Michelogiannakis, Raghu Shankar, and Boon Chong Ang, Roofline analysis and profiling of AI / HPC apps, OCP Open Systems for AI - IT Infrastructure Silicon Workshop, June 24, 2025,

Download File: Profiling-for-AI_ML-IT-workshop-2025.pdf (pdf: 1.3 MB)

George Michelogiannakis, Temporal and Pulse-Train Computing for Reliable and Efficient Computing, April 9, 2025,

Download File: jj_workshop_2025.pdf (pdf: 2.1 MB)

George Michelogiannakis, Reliable Novel Compute Methods for Unreliable Environments, Georgia Tech CRNCH summit, February 13, 2025,

Download File: georgia_tech_crnch_2025_2.pdf (pdf: 925 KB)

Jie Li, George Michelogiannakis, Samuel Maloney, Brandon Cook, Estela Suarez, John Shalf, "Job Scheduling in High Performance Computing Systems with Disaggregated Memory Resources", IEEE International Conference on Cluster Computing (CLUSTER), September 2024, doi: 10.1109/CLUSTER59578.2024.00033

Caroline Ellis Hammond, Patricia Gonzalez-Guerrero, George Michelogiannakis, Meriam Gay Bautista, Nirmalendu Bikash, "Triangle Counting in the Temporal Domain", ISLPED: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, September 2024,

Jie Li, George Michelogiannakis, Brandon Cook, John Shalf, Yong Chen, "Scheduling and Allocation of Disaggregated Memory Resources in HPC Systems", IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, May 2024,

Patricia Gonzalez-Guerrero, Κylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "Towards practical superconducting accelerators for machine learning using U-SFQ", ACM Journal on Emerging Technologies in Computing Systems, April 2024,

George Michelogiannakis, John Shalf, Chiplets for HPC, OCP Summit, February 6, 2024,

Download File: georgem_hpc.pptx.pdf (pdf: 6.3 MB)

Hamza Errahmouni Barkam, Sanggeon Yun, Hanning Chen, Paul Gensler, Albi Mema, Andrew Ding, George Michelogiannakis, Hussam Amrouch, Mohsen Imani, "Reliable hyperdimensional reasoning on unreliable emerging technologies", IEEE/ACM International Conference on Computer Aided Design (ICCAD), November 2023,

George Michelogiannakis, Yehia Arafa, Brandon Cook, Liang Yuan Dai, Abdel-Hameed Hameed Badawy, Madeleine Glick, Yuyang Wang, Keren Bergman, John shalf, "Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics", IEEE International Conference on Cluster Computing (CLUSTER), November 2023,

Meriam Gay Bautista, Darren Lyles, Kylie Huch, Patricia Gonzalez-Guerrero, George Michelogiannakis, "Area Efficient Asynchronous SFQ Pulse Round-Robin Distribution Network", IEEE Transactions on Circuits and Systems I: Regular Papers, November 2023,

George Michelogiannakis, Yehia Arafa, Brandon Cook, Liang Yuan Dai, Abdel-Hameed Badawy, Madeleine Glick, Keren Bergman, John Shalf, Efficient Intra-Rack Resource Disaggregation in HPC Using Co-Packaged DWDM Photonics, IEEE Cluster 2023, November 1, 2023,

Download File: ieee_cluster_photonics_disaggregation_2023.pdf (pdf: 1.1 MB)

Andrew M. Bartolo, Mohamed M. Sabry Aly, George Michelogiannakis, Subhasish Mitra, "MC-ELMM: Multi-Chip Endurance-Limited Memory Management", MEMSYS: Proceedings of the International Symposium on Memory Systems, October 2023,

Jie Li, George Michelogiannakis, Brandon Cook, Dulanya Cooray, Yong Chen, "Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter", ISC High Performance, Elsevier, May 2023,

George Michelogiannakis, Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter, ISC High Performance, May 2023,

Download File: isc2023.pdf (pdf: 1.1 MB)

Zhenguo Wu, Liang Yuan Dai, Asher Novick, Madeleine Glick, Ziyi Zhu, Sébastien Rumley, George Michelogiannakis, John Shalf, Keren Bergman, "Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications", IEEE Journal of Lightwave Technology, May 2023,

Kylie Huch, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Hyperdimensional Associative Memory Circuit for Scalable Machine Learning", IEEE Transactions on Applied Superconductivity, May 2023,

Patricia Gonzalez-Guerrero, Kylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "An Area Efficient Superconducting Unary CNN Accelerator", IEEE 24th International Symposium on Quality Electronic Design (ISQED), IEEE, April 2023,

Dilip Vasudevan, George Michelogiannakis, "Efficient Temporal Arithmetic Logic Design for Superconducting RSFQ Logic", IEEE Transactions on Applied Superconductivity, March 2023,

George Michelogiannakis, A Case for Intra-Rack Resource Disaggregation for HPC, HiPEAC conference 2023, January 17, 2023,

Download File: disaggregation.pptx.pdf (pdf: 1.3 MB)

Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Gay Bautista, George Michelogiannakis, "PaST-NoC: A Packet-Switched Superconducting Temporal NoC", IEEE Transactions on Applied Superconductivity, January 2023,

George Michelogiannakis, Intra-Rack Resource Disaggregation Using Emerging Photonics, OCP global summit, October 19, 2022,

Download File: disaggregation_2022.pdf (pdf: 953 KB)

John Shalf, George Michelogiannakis, Heterogeneous Integration for HPC, OCP global summit, October 19, 2022,

Download File: chiplets_2022.pdf (pdf: 1.2 MB)

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-Flux Shift Register for Race Logic and Its Applications", IEEE Transactions on Circuits and Systems I: Regular Papers, October 2022,

Alvin Oliver Glova, Yukai Yang, Yiyao Wan, Zhizhou Zhang, George Michelogiannakis, Jonathan Balkind, Timothy Sherwood, "Establishing Cooperative Computation with Hardware Embassies", IEEE International Symposium on Secure and Private Execution Environment Design, September 2022,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, Kylie Huch, George Michelogiannakis, "Superconducting Digital DIT Butterfly Unit for Fast Fourier Transform Using Race Logic", 2022 20th IEEE Interregional NEWCAS Conference (NEWCAS), IEEE, June 2022, 441-445,

George Michelogiannakis, Madeleine Glick, John Shalf, Keren Bergman, Photonics as a Means to Implement Intra-rack Resource Disaggregation, SPIE photonics west, March 2022,

George Michelogiannakis, Madeleine Glick, John Shalf, Keren Bergman, "Photonics as a means to implement intra-rack resource disaggregation", Proceedings Volume 12027, Metro and Data Center Optical Networks and Short-Reach Links V, March 2022, doi: https://doi.org/10.1117/12.2607317

Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, George Michelogiannakis, Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators, 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), February 2022,

Download File: asplos2022-presentation.pdf (pdf: 1.7 MB)

Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, George Michelogiannakis, "Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators", 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), ACM, February 2022,

Download File: asplos2022.pdf (pdf: 1.9 MB)

George Michelogiannakis, Benjamin Klenk, Brandon Cook, Min Yee Teh, Madeleine Glick, Larry Dennison, Keren Bergman, John Shalf, "A Case For Intra-Rack Resource Disaggregation in HPC", ACM Transactions on Architecture and Code Optimization, February 2022,

Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-flux Shift Buffer for Race Logic", 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), August 2021,

George Michelogiannakis, SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC, IEEE International Parallel and Distributed Processing Symposium, May 2021,

Download File: ipdps-2021-2.pptx (pptx: 1.7 MB)

George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021,

Georgios Tzimpragos, Jennifer Volk, Dilip Vasudevan, Nestan Tsiskaridze, George Michelogiannakis, Advait Madhavan, John Shalf, Timothy Sherwood, "Temporal Computing With Superconductors", IEEE MIcro, March 2021, 41:71-79, doi: 10.1109/MM.2021.3066377

George Michelogiannakis, Min Yeh Teh, Madeleine Glick, John Shalf, Keren Bergman, Maximizing The Impact of Emerging Photonic Switches At The System Level, SPIE photonics west, March 2021,

Download File: photonics-west-2021.pdf (pdf: 770 KB)

George Michelogiannakis, Min Yeh Teh, Madeleine Glick, John Shalf, Keren Bergman, "Maximizing the impact of emerging photonic switches at the system level", SPIE 11692, Optical Interconnects XXI, 116920Z, March 2021,

Pietro Benedusi, Michael L Minion, Rolf Krause, "An experimental comparison of a space-time multigrid method with PFASST for a reaction-diffusion problem", Computers & Mathematics with Applications, October 1, 2021,

Download File: Benedusi-Minion-Krause.pdf (pdf: 372 KB)

Tommaso Buvoli, Michael Minion, "IMEX Runge-Kutta Parareal for Non-diffusive Equations", Springer Proceedings in Mathematics & Statistics, August 25, 2021,

Sebastian Götschel, Michael Minion, Daniel Ruprecht, Robert Speck, "Twelve Ways To Fool The Masses When Giving Parallel-In-Time Results Authors", Springer Proceedings in Mathematics & Statistics, August 25, 2021,

Download File: Twelve-Ways.pdf (pdf: 847 KB)

Qiang Du, Dan Wang, Tong Zhou, Antonio Gilardi, Mariam Kiran, Bashir Mohammed, Derun Li, and Russell Wilcox, "Experimental beam combining stabilization using machine learning trained while phases drift", Advanced Solid State Lasers 2022, © 2022 Optica Publishing Group, June 1, 2022, Vol. 30,:pp. 12639-, doi: https://doi.org/10.1364/OE.450255

Shen Sheng, Mariam Kiran, Bashir Mohammed, "DynamicDeepFlow: An Approach for Identifying Changes in Network Traffic Flow Using Unsupervised Clustering", (BEST PAPER) 4th International Conference on Machine Learning for Networking (MLN'2021), December 6, 2021,

Bashir Mohammed, Mariam Kiran, Bjoern Enders, "NetGraf: An End-to-End Learning Network Monitoring Service", 2021 IEEE Workshop on Innovating the Network for Data-Intensive Science (INDIS), November 15, 2021, doi: 10.1109/INDIS54524.2021.00007

B Mohammed, M Kiran; N Krishnaswamy; Keshang, Wu, "Predicting WAN Traffic Volumes using Fourier and Multivariate SARIMA Approach", International Journal of Big Data Intelligence, November 3, 2021, doi: 10.1504/IJBDI.2021.118742

M Kiran, B Mohammed, Q Du, D Wang, S Shen, R Wilcox, "Controlling Laser Beam Combining via an Active Reinforcement Learning Algorithm", Advanced Solid State Lasers 2021, Washington, DC United States, October 4, 2021,

Nathan A. Kimbrel, Allison E. Ashley-Koch, Xue J. Qin, Jennifer H. Lindquist, Melanie E. Garrett, Michelle F. Dennis, Lauren P. Hair, Jennifer E. Huffman, Daniel A. Jacobson, Ravi K. Madduri, Jodie A. Trafton, Hilary Coon, Anna R. Docherty, Niamh Mullins, Douglas M. Ruderfer, Philip D. Harvey, Benjamin H. McMahon, David W. Oslin, Jean C. Beckham, Elizabeth R. Hauser, Michael A. Hauser, Million Veteran Program Suicide Exemplar Workgroup, International Suicide Genetics Consortium, Veterans Affairs Mid-Atlantic Mental Illness Research Education and Clinical Center Workgroup, Veterans Affairs Million Veteran Program, "Identification of Novel, Replicable Genetic Risk Loci for Suicidal Thoughts and Behaviors Among US Military Veterans", JAMA Psychiatry, February 1, 2023, 80:100-191, doi: 10.1001/jamapsychiatry.2022.3896

Destinee Morrow, Rafael Zamora-Resendiz, Jean C Beckham, Nathan A Kimbrel, David W Oslin, Suzanne Tamang, Million Veteran Program Suicide Exemplar Workgroup, Silvia Crivelli, "A case for developing domain-specific vocabularies for extracting suicide factors from healthcare notes", Journal of Psychiatric Research, July 1, 2022, 151:328-338, doi: 10.1016/j.jpsychires.2022.04.009

C Varadharajan, AP Appling, B Arora, DS Christianson, VC Hendrix, V Kumar, AR Lima, J Müller, S Oliver, M Ombadi, T Perciano, JM Sadler, H Weierbach, JD Willard, Z Xu, J Zwart, "Can machine learning accelerate process understanding and decision-relevant predictions of river water quality?", Hydrological Processes, January 1, 2022, 36, doi: 10.1002/hyp.14565

V. Dumont, C. Garner, A. Trivedi, C. Jones, V. Ganapati, J. Mueller, T. Perciano, M. Kiran, and M. Day, "HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization", 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), November 15, 2021,

J Müller, B Faybishenko, D Agarwal, S Bailey, C Jiang, Y Ryu, C Tull, L Ramakrishnan, Assessing data change in scientific datasets, Concurrency and Computation: Practice and Experience, 2021, doi: 10.1002/cpe.6245

Alina Lazar, others, Accelerating the Inference of the Exa.TrkX Pipeline, 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research: AI Decoded - Towards Sustainable, Diverse, Performant and Effective Scientific Computing, 2022,

Chun-Yi Wang, others, Reconstruction of Large Radius Tracks with the Exa.TrkX pipeline, 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research: AI Decoded - Towards Sustainable, Diverse, Performant and Effective Scientific Computing, 2022,

Savannah Thais, Paolo Calafiura, Grigorios Chachamis, Gage DeZoort, Javier Duarte, Sanmay Ganguly, Michael Kagan, Daniel Murnane, Mark S. Neubauer, Kazuhiro Terao, Graph Neural Networks in Particle Physics: Implementations, Innovations, and Challenges, 2022 Snowmass Summer Study, 2022,

Xiangyang Ju, others, Performance of a geometric deep learning pipeline for HL-LHC particle tracking, Eur. Phys. J. C, Pages: 876 2021, doi: 10.1140/epjc/s10052-021-09675-8

H. Klion, R. Jambunathan, M. E. Rowan, E. Yang, D. Willcox, J.-L. Vay, R. Lehe, A. Myers, A. Huebl, W. Zhang, "Particle-in-Cell Simulations of Relativistic Magnetic Reconnection with Advanced Maxwell Solver Algorithms", The Astrophysical Journal, July 13, 2023, 952,

Andrew Myers, Ann Almgren, Diana Almorim, John Bell, Luca Fedeli, Lixin Ge, Kevin Gott, David Grote, Mark Hogan, Axel Huebl, Revathi Jambunathan, Remi Lehe, Cho Ng, Michael Rowan, Olga Shapoval, Maxence Thevenet, Jean-Luc Vay, Henri Vincenti, Eloise Yang, Neil Zaim, Weiqun Zhang, Yin Zhao, Edoardo Zoni, "Porting WarpX to GPU-accelerated platforms", Parallel Computing, December 1, 2021,

Jean Sexton, Zarija Lukic, Ann Almgren, Chris Daley, Brian Friesen, Andrew Myers, and Weiqun Zhang, "Nyx: A Massively Parallel AMR Code for Computational Cosmology", The Journal Of Open Source Software, July 10, 2021,

Weiqun Zhang, Andrew Myers, Kevin Gott, Ann Almgren and John Bell, "AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications", The International Journal of High Performance Computing Applications, June 12, 2021,

L. Fedeli, A. Sainte-Marie, N. Zaim, M. Thevenet, J. L. Vay, A. Myers, F. Quere, and H. Vincenti, "Probing strong-field QED with Doppler-boosted petawatt-class lasers", Physical Review Letters, May 10, 2021,

Sherwood Richers, Don E. Willcox, Nicole M. Ford, and Andrew Myers, "Particle-in-cell simulation of the neutrino fast flavor instabilit", Physical Review D, April 20, 2021,

Jordan Musser, Ann S Almgren, William D Fullmer, Oscar Antepara, John B Bell, Johannes Blaschke, Kevin Gott, Andrew Myers, Roberto Porcu, Deepak Rangarajan, Michele Rosso, Weiqun Zhang, and Madhava Syamlal, "MFIX:Exa: A Path Towards Exascale CFD-DEM Simulations", The International Journal of High Performance Computing Applications, April 16, 2021,

J-L Vay, Ann Almgren, LD Amorim, John Bell, L Fedeli, L Ge, K Gott, DP Grote, M Hogan, A Huebl, R Jambunathan, R Lehe, A Myers, C Ng, M Rowan, O Shapoval, M Thevenet, H Vincenti, E Yang, N Zaim, W Zhang, Y Zhao and E Zoni, "Modeling of a chain of three plasma accelerator stages with the WarpX electromagnetic PIC code on GPUs", Physics of Plasmas, February 9, 2021,

Noah Goss, Samuele Ferracin, Akel Hashim, Arnaud Carignan-Dugas, John Mark Kreikebaum, Ravi K Naik, David I Santiago, Irfan Siddiqi, "Extending the computational reach of a superconducting qutrit processor", npj Quantum Information, 2024, 10:101, doi: 10.1038/s41534-024-00892-z

Samuele Ferracin, Akel Hashim, Jean-Loup Ville, Ravi Naik, Arnaud Carignan-Dugas, Hammam Qassim, Alexis Morvan, David I. Santiago, Irfan Siddiqi, Joel J. Wallman, "Efficiently improving the performance of noisy quantum computers", Quantum, 2024, 8:1410, doi: 10.22331/q-2024-07-15-1410

Ankur Agrawal, Akash V. Dixit, Tanay Roy, Srivatsan Chakram, Kevin He, Ravi K. Naik, David I. Schuster, Aaron Chou, "Stimulated Emission of Signal Photons from Dark Matter Waves", Physical Review Letters, 2024, 132:140801, doi: 10.1103/PhysRevLett.132.140801

Long B Nguyen, Yosep Kim, Akel Hashim, Noah Goss, Brian Marinelli, Bibek Bhandari, Debmalya Das, Ravi K Naik, John Mark Kreikebaum, Andrew N Jordan, others, "Programmable Heisenberg interactions between Floquet qubits", Nature Physics, 2024, 20:240-246, doi: 10.1038/s41567-023-02326-7

Jordan Hines, Marie Lu, Ravi K. Naik, Akel Hashim, Jean-Loup Ville, Brad Mitchell, John Mark Kriekebaum, David I. Santiago, Stefan Seritan, Erik Nielsen, Robin Blume-Kohout, Kevin Young, Irfan Siddiqi, Birgitta Whaley, Timothy Proctor, "Demonstrating Scalable Randomized Benchmarking of Universal Gate Sets", Phys. Rev. X, 2023, 041030, doi: 10.1103/PhysRevX.13.041030

Akel Hashim, Stefan Seritan, Timothy Proctor, Kenneth Rudinger, Noah Goss, Ravi K Naik, John Mark Kreikebaum, David I Santiago, Irfan Siddiqi, "Benchmarking quantum logic operations relative to thresholds for fault tolerance", npj Quantum Information, 2023, 9:109, doi: 10.1038/s41534-023-00764-y

Noah Goss, Alexis Morvan, Brian Marinelli, Bradley K Mitchell, Long B Nguyen, Ravi K Naik, Larry Chen, Christian J{\"u}nger, John Mark Kreikebaum, David I Santiago, others, "High-fidelity qutrit entangling gates for superconducting circuits", Nature Communications, 2022, 13:7481, doi: 10.1038/s41467-022-34851-z

Yilun Xu, Gang Huang, Jan Balewski, Alexis Morvan, Kasra Nowrouzi, David I. Santiago, Ravi K. Naik, Brad Mitchell, Irfan Siddiqi, "Automatic Qubit Characterization and Gate Optimization with QubiC", ACM Transactions on Quantum Computing, 2022, doi: 10.1145/3529397

Akel Hashim, Rich Rines, Victory Omole, Ravi K. Naik, John Mark Kreikebaum, David I. Santiago, Frederic T. Chong, Irfan Siddiqi, Pranav Gokhale, "Optimized SWAP networks with equivalent circuit averaging for QAOA", Phys. Rev. Research, 2022, 033028, doi: 10.1103/PhysRevResearch.4.033028

Srivatsan Chakram, Kevin He, Akash V. Dixit, Andrew E. Oriani, Ravi K. Naik, Nelson Leung, Hyeokshin Kwon, Wen-Long Ma, Liang Jiang, David I. Schuster, "Multimode photon blockade", Nature Physics, 2022, doi: 10.1038/s41567-022-01630-y

Yosep Kim, Alexis Morvan, Long B Nguyen, Ravi K Naik, Christian J\ unger, Larry Chen, John Mark Kreikebaum, David I Santiago, Irfan Siddiqi, "High-fidelity three-qubit iToffoli gate for fixed-frequency superconducting qubits", Nature Physics, 2022, 1--6, doi: 10.1038/s41567-022-01590-3

Akel Hashim, Ravi K. Naik, Alexis Morvan, Jean-Loup Ville, Bradley Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin P. O Brien, Ian Hincks, Joel J. Wallman, Joseph Emerson, Irfan Siddiqi, "Randomized Compiling for Scalable Quantum Computing on a Noisy Superconducting Quantum Processor", Physical Review X, 2021, 11:041039, doi: 10.1103/PhysRevX.11.041039

Kenneth Rudinger, Craig W Hogle, Ravi K Naik, Akel Hashim, Daniel Lobser, David I Santiago, Matthew D Grace, Erik Nielsen, Timothy Proctor, Stefan Seritan, others, "Experimental Characterization of Crosstalk Errors with Simultaneous Gate Set Tomography", PRX Quantum, 2021, 2:040338, doi: 10.1103/PRXQuantum.2.040338

Bradley K. Mitchell, Ravi K. Naik, Alexis Morvan, Akel Hashim, John Mark Kreikebaum, Brian Marinelli, Wim Lavrijsen, Kasra Nowrouzi, David I. Santiago, Irfan Siddiqi, "Hardware-Efficient Microwave-Activated Tunable Coupling between Superconducting Qubits", Physical Review Letters, 2021, 127:200502, doi: 10.1103/PhysRevLett.127.200502

Yilun Xu, Gang Huang, Jan Balewski, Ravi Naik, Alexis Morvan, Bradley Mitchell, Kasra Nowrouzi, David I. Santiago, Irfan Siddiqi, "QubiC: An Open-Source FPGA-Based Control and Measurement System for Superconducting Quantum Information Processors", IEEE Transactions on Quantum Engineering, 2021, 2:1-11, doi: 10.1109/TQE.2021.3116540

Srivatsan Chakram, Andrew E. Oriani, Ravi K. Naik, Akash V. Dixit, Kevin He, Ankur Agrawal, Hyeokshin Kwon, David I. Schuster, "Seamless High-Q Microwave Cavities for Multimode Circuit Quantum Electrodynamics", Physical Review Letters, 2021, 127:107701, doi: 10.1103/PhysRevLett.127.107701

G Koolstra, N Stevenson, S Barzili, L Burns, K Siva, S Greenfield, W Livingston, A Hashim, RK Naik, JM Kreikebaum, KP O'Brien, DI Santiago, J Dressel, I Siddiqi, "Monitoring fast superconducting qubit dynamics using a neural network", Preprint, August 2021,

Akel Hashim, Ravi Naik, Alexis Morvan, Jean-Loup Ville, Brad Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin O Brien, Ian Hincks, Joel Wallman, Joseph V Emerson, David Ivan Santiago, Irfan Siddiqi, Scalable Quantum Computing on a Noisy Superconducting Quantum Processor via Randomized Compiling, Bulletin of the American Physical Society, 2021,

Coherent errors in quantum hardware severely limit the performance of quantum algorithms in an unpredictable manner, and mitigating their impact is necessary for realizing reliable, large-scale quantum computations. Randomized compiling achieves this goal by converting coherent errors into stochastic noise, dramatically reducing unpredictable errors in quantum algorithms and enabling accurate predictions of aggregate performance via cycle benchmarking estimates. In this work, we demonstrate significant performance gains under randomized compiling for both the four-qubit quantum Fourier transform algorithm and for random circuits of variable depth on a superconducting quantum processor. We also validate solution accuracy using experimentally-measured error rates. Our results demonstrate that randomized compiling can be utilized to maximally-leverage and predict the capabilities of modern-day noisy quantum processors, paving the way forward for scalable quantum computing.

Akash V Dixit, Srivatsan Chakram, Kevin He, Ankur Agrawal, Ravi K Naik, David I Schuster, Aaron Chou, "Searching for dark matter with a superconducting qubit", Physical Review Letters, 2021, 126:141302, doi: 10.1103/PhysRevLett.126.141302

Alexis Morvan, VV Ramasesh, MS Blok, JM Kreikebaum, K O’Brien, L Chen, BK Mitchell, RK Naik, DI Santiago, I Siddiqi, "Qutrit randomized benchmarking", Physical Review Letters, 2021, 126:210504, doi: 10.1103/PhysRevLett.126.210504

John Bachan, Jianlan Ye, Xuan Jiang, Tan Nguyen, Mahesh Natarajan, Maximilian Bremer, Cy Chan, "Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++", In 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS ’24), June 24, 2024,

Julian Bellavita, Mathias Jacquelin, Esmond G. Ng, Dan Bonachea, Johnny Corbino, Paul H. Hargrove, "symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver", 2023 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'23), ACM, November 13, 2023, doi: 10.1145/3624062.3624600

Sparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

Anne M. Felden, Daniel F. Martin, Esmond G. Ng, "SUHMO: an AMR SUbglacial Hydrology MOdel v1.0", Geosci. Model Dev. Discuss., July 27, 2022,

Download File: gmd-2022-190.pdf (pdf: 5.5 MB)

Daniel F. Martin, Stephen L. Cornford, Esmond G. Ng, Impact of Improved Bedrock Geometry and Basal Friction Relations on Antarctic Vulnerability to Regional Ice Shelf Collapse, Americal Geophysical Union Fall Meeting, December 15, 2021,

Courtney Shafer, Daniel F Martin and Esmond G Ng, "Comparing the Shallow-Shelf and L1L2 Approximations using BISICLES in the Context of MISMIP+ with Buttressing Effects", AGU Fall Meeting, December 13, 2021,

Anne M. Felden, Daniel F. Martin, Esmond G. Ng, SUHMO: An SUbglacial Hydrology MOdel based on the Chombo AMR framework, American Geophysical Union Fall Meeting, December 13, 2021,

John Bachan, Jianlan Ye, Xuan Jiang, Tan Nguyen, Mahesh Natarajan, Maximilian Bremer, Cy Chan, "Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++", In 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS ’24), June 24, 2024,

Maximilian Bremer, Nirmalendu Patra, Tan Nguyen, Dilip Vasudevan, Cy Chan, "Benefits of Optimistic Parallel Discrete Event Simulation for Network-on-Chip Simulation", 2023 IEEE/ACM 27th International Symposium on Distributed Simulation and Real Time Applications (DS-RT), Singapore, October 2, 2023, doi: 10.1109/DS-RT58998.2023.00013

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Tan Nguyen, Erich Strohmaier, John Shalf, "Facilitating CoDesign with Automatic Code Similarity Learning", 7th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), November 14, 2021,

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671

J. B. Bell, A. Nonaka, and A. L. Garcia, "A Study of Spherical and Sessile Droplet Dynamics by Fluctuating Hydrodynamics", Physics of Fluids, January 15, 2025, 37, doi: https://doi.org/10.1063/5.0249847

I. Srivastava, A. J. Nonaka, W. Zhang, A. L. Garcia, and J. B. Bell, "Molecular Fluctuations Inhibit Intermittency in Compressible Turbulence", Submitted for publication, January 11, 2025, doi: https://doi.org/10.48550/arXiv.2501.06396

M. Polimeno, C. Kim, F. Blanchette, I. Srivastava, A. Garcia, A. Nonaka, and J. Bell, "Thermodynamic consistency and fluctuations in mesoscopic stochastic simulations of reactive gas mixture", December 9, 2024, doi: https://doi.org/10.48550/arXiv.2412.07048

Y. Tang, R. Chen, M. Lou, J. Fan, C. Yu, A. Nonaka, Z. Yao., W. Gao, "Optical Neural Engine for Solving Scientific Partial Differential Equations", Submitted for publication, September 27, 2024, doi: https://doi.org/10.48550/arXiv.2409.06234

S. S. Sawant, F. Leonard, Z. Yao, A. Nonaka, "ELEQTRONeX: A GPU-Accelerated Exascale Framework for Non-Equilibrium Quantum Transport in Nanomaterials", Submitted for Publication, July 19, 2024, doi: https://doi.org/10.48550/arXiv.2407.14633

P. Kumar, M. Hoffmann, A. Nonaka, S. Salahuddin, and Z. Yao, "3D ferroelectric phase field simulations of polycrystalline multi-phase hafnia and zirconia based ultra-thin film", Advanced Electronic Materials, June 25, 2024, doi: https://doi.org/10.1002/aelm.202400085

A. L. Garcia, J. B. Bell, A. Nonaka, I. Srivastava, D. Ladiges, C. Kim, "An Introduction to Computational Fluctuating Hydrodynamics", June 18, 2024, doi: https://doi.org/10.48550/arXiv.2406.12157

J. B. Bell, A. Nonaka, and A. L. Garcia, "Comment on "Brownian motion of droplets induced by thermal noise"", Submitted for publication, April 1, 2024, doi: https://doi.org/10.48550/arXiv.2404.01444

J. L. Loffeld, A. J. Nonaka, D. R. Reynolds, D. J. Gardner, and C. S. Woodward, "Performance of explicit and IMEX MRI multirate methods on complex reactive flow problems within modern parallel adaptive structured grid frameworks", International Journal of High Performance Computing Applications, February 25, 2024, doi: https://doi.org/10.1177/10943420241227914

L. Esclapez, M. Day, J. Bell, A. Felden, C. Gilet, R. Grout, M. Henry de Frahan, E. Motheau, A. Nonaka, L. Owen, B. Perry, J. Rood, N. Wimer, and W. Zhang, "PeleLMeX: an AMR Low Mach Number Reactive Flow Simulation Code without level sub-cycling", Journal of Open Source Software, October 31, 2023, 8, doi: doi.org/10.21105/joss.05450

E. Mercado, H. T. Jung, C. Kim, A. L. Garcia, A. J. Nonaka, and J. B. Bell, "Surface Coverage Dynamics for Reversible Dissociative Adsorption on Finite Linear Lattices", J. Chem. Phys., October 12, 2023, 159:144107,

R. Jambunathan, Z. Yao, R. Lombardini, A. Rodriguez, and A. Nonaka, "Two-Fluid Physical Modeling of Superconducting Resonators in the ARTEMIS Framework", Computer Physics Communications, October 2, 2023, 291:108836,

P. Kumar, A. Nonaka, R. Jambunathan, G. Pahwa, S. Salahuddin, and Z. Yao, "FerroX: A GPU-accelerated, 3D Phase-Field Simulation Framework for Modeling Ferroelectric Devices", Computer Physics Communications, September 1, 2023, 290:108757,

J. G. Wang, D. R. Ladiges, I. Srivastava, S. P. Carney, A. J. Nonaka, A. L. Garcia, J. B. Bell, "Steric effects in induced-charge electro-osmosis for strong electric fields", Physical Review Fluids, August 29, 2023, 8:083702,

I. Srivastava, D. R. Ladiges, A. Nonaka, A. L. Garcia, J. B. Bell, "Staggered Scheme for the Compressible Fluctuating Hydrodynamics of Multispecies Fluid Mixtures", Physical Review E, January 24, 2023, 107:015305, doi: 10.1103/PhysRevE.107.015305

S. S. Sawant, Z. Yao, R. Jambunathan, A. Nonaka, "Characterization of Transmission Lines in Microelectronic Circuits Using the ARTEMIS Solver", IEEE Journal on Multiscale and Multiphysics Computational Techniques, December 12, 2022, 8:31-39,

D. Fan, D. E. Willcox, C. DeGrendele, M. Zingale, and A. Nonaka, "Neural Networks for Nuclear Reactions in MAESTROeX", he Astrophysical Journal, November 29, 2022, 940,

D. R. Ladiges, J. G. Wang, I. Srivastava, S. P. Carney, A. Nonaka, A. L. Garcia, A. Donev, J. B. Bell, "Modeling Electrokinetic Flows with the Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm", Physical Review E, November 19, 2022, 106:035104, doi: 10.1103/PhysRevE.106.035104

M. Zingale, M. P. Katz, A. Nonaka, and M. Rasmussen, "An Improved Method for Coupling Hydrodynamics with Astrophysical Reaction Networks", Astrophysical Journal, August 25, 2022, 936,

J. Bell, A. Nonaka, A. L. Garcia, G. Eyink, "Thermal Fluctuations in the Dissipation Range of Homogeneous Isotropic Turbulence", J. Fluid Mech, March 24, 2022, 939,

Z. Yao, R. Jambunathan, Y. Zeng, and A. Nonaka, "A Massively Parallel Time-Domain Coupled Electrodynamics-Micromagnetics Solver", International Journal of High Performance Computing Applications, January 10, 2022, accepted,

Daniel R. Ladiges, Sean P. Carney, Andrew Nonaka, Katherine Klymko, Guy C. Moore, Alejandro L. Garcia, Sachin R. Natesh, Aleksandar Donev, John B. Bell, "A Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm for Modeling Electrolytes", Physical Review Fluids, April 1, 2021, 6(4):044309,

Melissa L. Graham, Robert A. Knop, Thomas Kennedy, Peter E. Nugent, Eric Bellm, Márcio Catelan, Avi Patel, Hayden Smotherman, Monika Soraisam, Steven Stetzler, Lauren N. Aldoroty, Autumn Awbrey, Karina Baeza-Villagra, Pedro H. Bernardinelli, Federica Bianco, Dillon Brout, Riley Clarke, William I. Clarkson, Thomas Collett, James R. A. Davenport, Shenming Fu, John E. Gizis, Ari Heinze, Lei Hu, Saurabh W. Jha, Mario Jurić, J. Bryce Kalmbach, Alex Kim, Chien-Hsiu Lee, Chris Lidman, Mark Magee, Clara E. Martínez-Vázquez, Thomas Matheson, Gautham Narayan, Antonella Palmese, Christopher A. Phillips, Markus Rabus, Armin Rest, Nicolás Rodríguez-Segovia, Rachel Street, A. Katherina Vivas, Lifan Wang, Nicholas Wolf, Jiawen Yang, "Deep drilling in the time domain with DECam: Survey characterization", Monthly Notices of the Royal Astronomical Society, November 2022,

Venkitesh Ayyar, Robert Knop, Autumn Awbrey, Alexis Andersen, Peter Nugent, "Identifying Transient Candidates in the Dark Energy Survey Using Convolutional Neural Networks", Publications of the Astronomical Society of the Pacific, September 2022, 134:094501,

The ability to discover new transient candidates via image differencing without direct human intervention is an important task in observational astronomy. For these kind of image classification problems, machine learning techniques such as Convolutional Neural Networks (CNNs) have shown remarkable success. In this work, we present the results of an automated transient candidate identification on images with CNNs for an extant data set from the Dark Energy Survey Supernova program, whose main focus was on using Type Ia supernovae for cosmology. By performing an architecture search of CNNs, we identify networks that efficiently select non-artifacts (e.g., supernovae, variable stars, AGN, etc.) from artifacts (image defects, mis-subtractions, etc.), achieving the efficiency of previous work performed with random Forests, without the need to expend any effort in feature identification. The CNNs also help us identify a subset of mislabeled images. Performing a relabeling of the images in this subset, the resulting classification with CNNs is significantly better than previous results, lowering the false positive rate by 27% at a fixed missed detection rate of 0.05.

K. Wang, S. Lee, J. Balewski, A. Sim, P. Nugent, A. Agrawal, A. Choudhary, K. Wu, W-K. Liao, "Using Multi-resolution Data to Accelerate Neural Network Training in Scientific Applications", 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2022), 2022, doi: 10.1109/CCGrid54584.2022.00050

S. Lee, Q. Kang, K. Wang, J. Balewski, A. Sim, A. Agrawal, A. Choudhary, P. Nugent, K. Wu, W-K. Liao, "Asynchronous I/O Strategy for Large-Scale Deep Learning Applications", IEEE International Conference on High Performance Computing, Data & Analytics (HiPC2021), 2021, doi: 10.1109/HiPC53243.2021.00046

Abdullah Alperen, Nan Ding, Khaled Z. Ibrahim, Pieter Maris, Leonid Oliker, Chao Yang, Hasan Metin Aktulga, "Optimizing Nuclear Configuration Interaction Calculations on GPUs: A Comparative Performance Study of Programming Models", https://isc.app.swapcard.com/event/isc-high-performance-2025/planning/UGxhbm5pbmdfMjU4OTMyNg==, June 12, 2025,

Download File: ISC25_MFDn_opt.pdf (pdf: 7.7 MB)

Nan Ding, Oscar Antepara, Zhengji Zhao, Brian Austin, Leonid Oliker, Nicholas J. Wright, Samuel Williams, "Maximizing Power-Constrained Supercomputing Throughput", ISC'25, June 11, 2025,

Download File: ISC25_GPU_Power_Cap.pdf (pdf: 5.2 MB)

Zhe Bai, Xishuo Wei, William Tang, Leonid Oliker, Zhihong Lin, Samuel Williams, "Transfer Learning Nonlinear Plasma Dynamic Transitions in Low Dimensional Embeddings via Deep Neural Networks", Machine Learning: Science and Technology, April 8, 2025, doi: 10.1088/2632-2153/adca83

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", PMBS, November 2022,

Download File: PMBS22_GPU_final.pdf (pdf: 719 KB)

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

K. Ibrahim, L. Oliker,, "Preprocessing Pipeline Optimization for Scientific Deep-Learning Workloads", IPDPS 22, June 3, 2022,

Download File: SciML-optimization-12.pdf (pdf: 17 MB)

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Andrew Adams, Dan Arnold, Jeannette Dopheide, Ryan Kiser, Mark Krenz, Mikeal Jones, Drew Paine, Sean Peisert, Michael Simpson, John Zage, "Trusted CI Operational Technology Procurement Vendor Matrix v2", Trusted CI Report, September 23, 2024, doi: 10.5281/zenodo.13743314

Andrew Adams, Daniel Arnold, Jeannette Dopheide, Shane Filus, Mikeal Jones, Mark Krenz, Drew Paine, Sean Peisert, Michael M. Simpson, John Zage, "Guide to Using the Trusted CI OT Procurement Matrix", Trusted CI Report, September 23, 2024, doi: 10.5281/zenodo.10257812

Andrew Adams, Dan Arnold, Jeannette Dopheide, Ryan Kiser, Mark Krenz, Drew Paine, Sean Peisert, Michael Simpson, John Zage, "Trusted CI Operational Technology Procurement Vendor Matrix", Trusted CI Report, December 14, 2023, doi: 10.5281/zenodo.10257812

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Drew Paine, Sarah Poon, Michael Beach, Alpha N'Diaye, Patrick Huck, Lavanya Ramakrishnan, "Science Capsule: Towards Sharing and Reproducibility of Scientiﬁc Workﬂows", 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), November 15, 2021, doi: 10.1109/WORKS54523.2021.00014

Workflows are increasingly processing large volumes of data from scientific instruments, experiments and sensors. These workflows often consist of complex data processing and analysis steps that might include a diverse ecosystem of tools and also often involve human-in-the-loop steps. Sharing and reproducing these workflows with collaborators and the larger community is critical but hard to do without the entire context of the workflow including user notes and execution environment. In this paper, we describe Science Capsule, which is a framework to capture, share, and reproduce scientific workflows. Science Capsule captures, manages and represents both computational and human elements of a workflow. It automatically captures and processes events associated with the execution and data life cycle of workflows, and lets users add other types and forms of scientific artifacts. Science Capsule also allows users to create `workflow snapshots' that keep track of the different versions of a workflow and their lineage, allowing scientists to incrementally share and extend workflows between users. Our results show that Science Capsule is capable of processing and organizing events in near real-time for high-throughput experimental and data analysis workflows without incurring any significant performance overheads.

Drew Paine, Sarah Poon, Lavanya Ramakrishnan, "Investigating User Experiences with Data Abstractions on High Performance Computing Systems", June 29, 2021, LBNL LBNL-2001374,

Scientific exploration generates expanding volumes of data that commonly require High Performance Computing (HPC) systems to facilitate research. HPC systems are complex ecosystems of hardware and software that frequently are not user friendly. The Usable Data Abstractions (UDA) project set out to build usable software for scientific workflows in HPC environments by undertaking multiple rounds of qualitative user research. Qualitative research investigates how individuals accomplish their work and our interview-based study surfaced a variety of insights about the experiences of working in and with HPC ecosystems. This report examines multiple facets to the experiences of scientists and developers using and supporting HPC systems. We discuss how stakeholders grasp the design and configuration of these systems, the impacts of abstraction layers on their ability to successfully do work, and the varied perceptions of time that shape this work. Examining the adoption of the Cori HPC at NERSC we explore the anticipations and lived experiences of users interacting with this system's novel storage feature, the Burst Buffer. We present lessons learned from across these insights to illustrate just some of the challenges HPC facilities and their stakeholders need to account for when procuring and supporting these essential scientific resources to ensure their usability and utility to a variety of scientific practices.

Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan, "Experiences with Reproducibility: Case Studies from Scientific Workflows", (P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, ACM, June 21, 2021, doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Michael Beach, Drew Paine, Lavanya Ramakrishnan, "Science Capsule - Capturing the Data Life Cycle", Journal of Open Source Software, 2021, 6:2484, doi: 10.21105/joss.02484

Marco Pritoni, Drew Paine, Gabriel Fierro, Cory Mosiman, Michael Poplawski, Joel Bender, Jessica Granderson, "Metadata Schemas and Ontologies for Building Energy Applications: A Critical Review and Use Case Analysis", Energies, April 6, 2021, doi: 10.3390/en14072024

Digital and intelligent buildings are critical to realizing efficient building energy operations and a smart grid. With the increasing digitalization of processes throughout the life cycle of buildings, data exchanged between stakeholders and between building systems have grown significantly. However, a lack of semantic interoperability between data in different systems is still prevalent and hinders the development of energy-oriented applications that can be reused across buildings, limiting the scalability of innovative solutions. Addressing this challenge, our review paper systematically reviews metadata schemas and ontologies that are at the foundation of semantic interoperability necessary to move toward improved building energy operations. The review finds 40 schemas that span different phases of the building life cycle, most of which cover commercial building operations and, in particular, control and monitoring systems. The paper’s deeper review and analysis of five popular schemas identify several gaps in their ability to fully facilitate the work of a building modeler attempting to support three use cases: energy audits, automated fault detection and diagnosis, and optimal control. Our findings demonstrate that building modelers focused on energy use cases will find it difficult, labor intensive, and costly to create, sustain, and use semantic models with existing ontologies. This underscores the significant work still to be done to enable interoperable, usable, and maintainable building models. We make three recommendations for future work by the building modeling and energy communities: a centralized repository with a search engine for relevant schemas, the development of more use cases, and better harmonization and standardization of schemas in collaboration with industry to facilitate their adoption by stakeholders addressing varied energy-focused use cases.

B Faybishenko, R Versteeg, G Pastorello, D Dwivedi, C Varadharajan, D Agarwal, Challenging problems of quality assurance and quality control (QA/QC) of meteorological time series data, Stochastic Environmental Research and Risk Assessment, Pages: 1049--1062 2022, doi: 10.1007/s00477-021-02106-w

Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan, "Experiences with Reproducibility: Case Studies from Scientific Workflows", (P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, ACM, June 21, 2021, doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

D. A. Agarwal, J. Damerow, C. Varadharajan, D. S. Christianson, G. Z. Pastorello, Y.-W. Cheah, L. Ramakrishnan, "Balancing the needs of consumers and producers for scientific data collections", Ecological Informatics, 2021, 62:101251, doi: 10.1016/j.ecoinf.2021.101251

Caroline Ellis Hammond, Patricia Gonzalez-Guerrero, George Michelogiannakis, Meriam Gay Bautista, Nirmalendu Bikash, "Triangle Counting in the Temporal Domain", ISLPED: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, September 2024,

Patricia Gonzalez-Guerrero, Κylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "Towards practical superconducting accelerators for machine learning using U-SFQ", ACM Journal on Emerging Technologies in Computing Systems, April 2024,

Maximilian Bremer, Nirmalendu Patra, Tan Nguyen, Dilip Vasudevan, Cy Chan, "Benefits of Optimistic Parallel Discrete Event Simulation for Network-on-Chip Simulation", 2023 IEEE/ACM 27th International Symposium on Distributed Simulation and Real Time Applications (DS-RT), Singapore, October 2, 2023, doi: 10.1109/DS-RT58998.2023.00013

Patricia Gonzalez-Guerrero, Kylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "An Area Efficient Superconducting Unary CNN Accelerator", IEEE 24th International Symposium on Quality Electronic Design (ISQED), IEEE, April 2023,

Hang Liu, Anna Scaglione, Sean Peisert,, "Privacy Leakage in Graph Signal to Graph Matching Problems", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, April 11, 2025, doi: 10.1109/ICASSP48485.2024.10447364

Sean Peisert, Adapting a Publicly Focused Individual Health-Care Model to Cybersecurity [From the Editors], IEEE Security & Privacy, Pages: 4–8 November 2024, doi: 10.1109/MSEC.2024.3467890

Hang Liu, Anna Scaglione, Sean Peisert, "Graph-Signal-to-Graph Matching for Network De-anonymization Attacks", IEEE Transactions on Information Forensics and Security, October 18, 2024, 19:10043-1005, doi: 10.1109/TIFS.2024.3483669

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, "Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets", Cybersecurity, September 25, 2024, doi: 10.1109/ICASSP48485.2024.10447364

Andrew Adams, Dan Arnold, Jeannette Dopheide, Ryan Kiser, Mark Krenz, Mikeal Jones, Drew Paine, Sean Peisert, Michael Simpson, John Zage, "Trusted CI Operational Technology Procurement Vendor Matrix v2", Trusted CI Report, September 23, 2024, doi: 10.5281/zenodo.13743314

Andrew Adams, Daniel Arnold, Jeannette Dopheide, Shane Filus, Mikeal Jones, Mark Krenz, Drew Paine, Sean Peisert, Michael M. Simpson, John Zage, "Guide to Using the Trusted CI OT Procurement Matrix", Trusted CI Report, September 23, 2024, doi: 10.5281/zenodo.10257812

Jayson R. Vavrek, Luozhong Zhou, Joshua Boverhof, Elisa R. Heymann, Barton P. Miller, Sean Peisert, "Differential Fuzz Testing to Detect Tampering In Sensor Systems and its Application to Arms Control Authentication", arXiv preprint arXiv:2404.05946, April 9, 2024, doi: 10.48550/arXiv.2404.05946

Andrew Adams, Dan Arnold, Jeannette Dopheide, Ryan Kiser, Mark Krenz, Drew Paine, Sean Peisert, Michael Simpson, John Zage, "Trusted CI Operational Technology Procurement Vendor Matrix", Trusted CI Report, December 14, 2023, doi: 10.5281/zenodo.10257812

Tong Wu, Anna Scaglione, Adrian Petru Surani, Daniel Arnold, Sean Peisert, "Network-Constrained Reinforcement Learning for Optimal EV Charging Control", Proceedings of the IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), October 2023,

Robert Currie, Sean Peisert, Anna Scaglione, Aram Shumavon, Nikhil Ravi, "Data Privacy for the Grid: Toward a Data Privacy Standard for Inverter-Based and Distributed Energy Resources", IEEE Power & Energy Magazine, October 1, 2023,

Jim Basney, Sean Peisert, Scott Russell, Kelli Shute, Bart Miller, Kathy Benninger, "A Vision for Securing NSF's Essential Scientific Cyberinfrastructure - Trusted CI Five-Year Strategic Plan (2024-2029)", Trusted CI Report, August 1, 2023, doi: 10.5281/zenodo.8193607

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, "Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets", Proceedings of the 2023 IEEE International Conference on Smart Applications, Communications and Networking (SmartNets), Istanbul, Turkey, July 25, 2023,

Sean Peisert, On Software Infrastructure: Develop, Prove, Profit? [From the Editors], IEEE Security & Privacy, July 2023, doi: 10.1109/MSEC.2023.3273492

Raksha Ramakrishna, Anna Scaglione, Tong Wu, Nikhil Ravi, Sean Peisert, "Differential Privacy for Class-based Data: A Practical Gaussian Mechanism", June 23, 2023, doi: 10.1109/TIFS.2023.3289128

Nikhil Ravi, Anna Scaglione, Julieta Giraldez, Parth Pradhan, Chuck Moran, Sean Peisert, "Solar Photovoltaic Systems Metadata Inference and Differentially Private Publication", arXiv preprint arXiv:2304.03749, April 7, 2023, doi: 10.48550/arXiv.2304.03749

Sean Peisert, The First 20 Years of IEEE Security & Privacy [From the Editors], IEEE Security & Privacy, Pages: 4-6 April 1, 2023, doi: 10.1109/MSEC.2023.3236420

George Cybenko, Carl Landwehr, Shari Lawrence Pfleeger, Sean Peisert, A 20th Anniversary Episode Chat With S&P Editors, IEEE Security & Privacy, Pages: 9-16 April 2023, doi: 10.1109/MSEC.2023.3239179

Hector G. Martin, Tijana Radivojevic, Jeremy Zucker, Kristofer Bouchard, Jess Sustarich, Sean Peisert, Dan Arnold, Nathan Hillson, Gyorgy Babnigg, Jose M. Marti, Christopher J. Mungall, Gregg T. Beckham, Lucas Waldburger, James Carothers, ShivShankar Sundaram, Deb Agarwal, Blake A. Simmons, Tyler Backman, Deepanwita Banerjee, Deepti Tanjore, Lavanya Ramakrishnan, Anup Singh, "Perspectives for Self-Driving Labs in Synthetic Biology", Current Opinion in Biotechnology, February 2023, doi: 10.1016/j.copbio.2022.102881

Ammar Haydari, Chen-Nee Chuah, Michael Zhang, Jane Macfarlane, Sean Peisert, "Differentially Private Map Matching for Mobility Trajectories", Proceedings of the 2022 Annual Computer Security Applications Conference (ACSAC), Austin, TX, ACM, December 2022, doi: 0.1145/3564625.3567974

Andrew Adams, Emily K. Adams, Dan Gunter, Ryan Kiser, Mark Krenz, Sean Peisert, John Zage, "Roadmap for Securing Operational Technology in NSF Scientific Research", Trusted CI Report, November 16, 2022, doi: 10.5281/zenodo.7327987

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power, "SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems", Proceedings of the 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED), September 2022,

Yize Chen, Yuanyuan Shi, Daniel Arnold, Sean Peisert, "SAVER: Safe Learning-Based Controller for Real-Time Voltage Regulation", Proceedings of the 2022 IEEE Power Engineering Society (PES) General Meeting, Denver, CO, July 2022,

Emily K. Adams, Daniel Gunter, Ryan Kiser, Mark Krenz, Sean Peisert, Susan Sons, John Zage, "Findings of the 2022 Trusted CI Study on the Security of Operational Technology in NSF Scientific Research", Trusted CI Report, July 15, 2022, doi: doi.org/10.5281/zenodo.6828675

Daniel Arnold, Sy-Toan Ngo, Ciaran Roberts, Yize Chen, Anna Scaglione, Sean Peisert, "Adam-based Augmented Random Search for Control Policies for Distributed Energy Resource Cyber Attack Mitigation", Proceedings of the 2022 American Control Conference (ACC), June 2022,

Sean Peisert, Unsafe at Any Clock Speed: the Insecurity of Computer System Design, Implementation, and Operation [From the Editors], IEEE Security & Privacy, Pages: 4-9 January 2022, doi: 10.0.4.85/MSEC.2021.3127086

Andrew Adams, Kay Avila, Elisa Heymann, Mark Krenz, Jason R. Lee, Barton Miller, Sean Peisert, "Guide to Securing Scientific Software", Trusted CI Report, December 14, 2021, doi: 10.5281/zenodo.5777646

James R. Clavin, Yue Huang, Xin Wang, Pradeep M. Prakash, Sisi Duan, Jianwu Wang, Sean Peisert, "A Framework for Evaluating BFT", Proceedings of the IEEE International Conference on Parallel and Distributed Systems (ICPADS), IEEE, December 2021,

Ammar Haydari, Michael Zhang, Chen-Nee Chuah, Jane Macfarlane, Sean Peisert, Adaptive Differential Privacy Mechanism for Aggregated Mobility Dataset, arXiv preprint arXiv:2112.08487, December 10, 2021,

Yize Chen, Yuanyuan Shi, Daniel Arnold, Sean Peisert, SAVER: Safe Learning-Based Controller for Real-Time Voltage Regulation, arXiv preprint arXiv:2111.15152,, November 30, 2021,

Luca Pion-Tonachini, Kristofer Bouchard, Hector Garcia Martin, Sean Peisert, W. Bradley Holtz, Anil Aswani, Dipankar Dwivedi, Haruko Wainwright, Ghanshyam Pilania, Benjamin Nachman, Babetta L. Marrone, Nicola Falco, Prabhat, Daniel Arnold, Alejandro Wolf-Yadlin, Sarah Powers, Sharlee Climer, Quinn Jackson, Ty Carlson, Michael Sohn, Petrus Zwart, Neeraj Kumar, Amy Justice, Claire Tomlin, Daniel Jacobson, Gos Micklem, Georgios V. Gkoutos, Peter J. Bickel, Jean-Baptiste Cazier, Juliane Müller, Bobbie-Jo Webb-Robertson, Rick Stevens, Mark Anderson, Ken Kreutz-Delgado, Michael W. Mahoney, James B. Brown,, Learning from Learning Machines: a New Generation of AI Technology to Meet the Needs of Science, arXiv preprint arXiv:2111.13786, November 27, 2021,

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets, arXiv preprint arXiv:2111.11661, November 23, 2021,

Nikhil Ravi, Anna Scaglione, Sachin Kadam, Reinhard Gentz, Sean Peisert, Brent Lunghino, Emmanuel Levijarvi, Aram Shumavon, Differentially Private K-means Clustering Applied to Meter Data Analysis and Synthesis, arXiv preprint arXiv:2112.03801, November 23, 2021,

Nikhil Ravi, Anna Scaglione, Sean Peisert, Colored Noise Mechanism for Differentially Private Clustering, arXiv preprint arXiv:2111.07850, November 15, 2021,

Yize Chen, Daniel Arnold, Yuanyuan Shi, Sean Peisert, Understanding the Safety Requirements for Learning-based Power Systems Operations, arXiv preprint arXiv:2110.04983, October 11, 2021,

Andrew Adams, Kay Avila, Elisa Heymann, Mark Krenz, Jason R. Lee, Barton Miller, Sean Peisert, "The State of the Scientific Software World: Findings of the 2021 Trusted CI Software Assurance Annual Challenge Interviews", Trusted CI Report, September 29, 2021,

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power,, "Enabling Design Space Exploration for RISC-V Secure Compute Environments", Proceedings of the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV), (co-located with ISCA 2021), June 17, 2021,

Ciaran Roberts, Sy-Toan Ngo, Alexandre Milesi, Anna Scaglione, Sean Peisert, Daniel Arnold, "Deep Reinforcement Learning for Mitigating Cyber-Physical DER Voltage Unbalance Attacks”", Proceedings of the 2021 American Control Conference (ACC), May 2021, doi: 10.23919/ACC50511.2021.9482815

Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert, "Performance Analysis of Scientific Computing Workloads on General Purpose TEEs", Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IEEE, May 2021, doi: 10.1109/IPDPS49936.2021.00115

Sean Peisert, "Trustworthy Scientific Computing", Communications of the ACM (CACM), May 2021, doi: 10.1145/3457191

Fabio Massacci, Trent Jaeger, Sean Peisert, "SolarWinds and the Challenges of Patching: Can We Ever Stop Dancing With the Devil?", IEEE Security & Privacy, April 2021, 14-19, doi: 10.1109/MSEC.2021.3050433

Sean Peisert, Bruce Schneier, Hamed Okhravi, Fabio Massacci, Terry Benzel, Carl Landwehr, Mohammad Mannan, Jelena Mirkovic, Atul Prakash, James Bret Michael, "Perspectives on the SolarWinds Incident", IEEE Security & Privacy, April 2021, 7-13, doi: 10.1109/MSEC.2021.3051235

Sean Peisert, Reflections on the Past, Perspectives on the Future [From the Editors], IEEE Security & Privacy, January 2021, doi: 10.1109/MSEC.2020.3034670

Á Sánchez-Villar, Z Bai, N Bertelli, EW Bethel, J Hillairet, T Perciano, S Shiraiwa, GM Wallace, JC Wright, "Real-time capable modeling of ICRF heating on NSTX and WEST via machine learning approaches", Nuclear Fusion, August 12, 2024, 64:9, doi: 10.1088/1741-4326/ad645d

Zhe Bai, Abdelilah Essiari, Talita Perciano, Kristofer E Bouchard, "AutoCT: Automated CT registration, segmentation, and quantification", Software X, February 28, 2024, 26, doi: 10.1016/j.softx.2024.101673

Jan Balewski, Mercy G Amankwah, Roel Van Beeumen, E Wes Bethel, Talita Perciano, Daan Camps, "Quantum-parallel vectorized data encodings and computations on trapped-ion and transmon QPUs", Journal, February 10, 2024, 14, doi: https://doi.org/10.1038/s41598-024-53720-x

E Wes Bethel, Mercy G Amankwah, Jan Balewski, Roel Van Beeumen, Daan Camps, Daniel Huang, Talita Perciano, "Quantum computing and visualization: A disruptive technological change ahead", Journal, November 6, 2023, 43, doi: https://doi.org/10.1109/MCG.2023.3316932

GM Wallace, Z Bai, N Bertelli, EW Bethel, T Perciano, S Shiraiwa, JC Wright, "Towards Fast, Accurate Predictions of RF Simulations via Data-driven Modeling: Forward and Lateral Models", Conference, AIP Publishing, August 1, 2023, 2984, doi: https://doi.org/10.1063/5.0162422

Li Zhou, Chao Yang, Weiguo Gao, Talita Perciano, Karen M Davies, Nicholas K Sauter, "A machine learning pipeline for membrane segmentation of cryo-electron tomograms", Journal of Computational Science, January 30, 2023, 66:101904, doi: 10.1016/j.jocs.2022.101904 Get rights and content Under a Creative Commons license

Gregory Wallace, Zhe Bai, Robbie Sadre, Talita Perciano, Nicola Bertelli, Syun'ichi Shiraiwa, Wes Bethel, John Wright, "Towards fast and accurate predictions of radio frequency power deposition and current profile via data-driven modelling: applications to lower hybrid current drive", Journal of Plasma Physics, August 18, 2022, 88:4, doi: 10.1017/S0022377822000708

M. G. Amankwah, D. Camps, E. W. Bethel, R. Van Beeumen, T. Perciano, "Quantum pixel representations and compression for N-dimensional images", Nature Scientific Reports, May 11, 2022, 12:7712, doi: 10.1038/s41598-022-11024-y

S. Zhang, R. Sadre, B. A. Legg, H. Pyles, T. Perciano, E. W. Bethel, D. Baker, O. Rübel, J. J. D. Yoreo, "Rotational dynamics and transition mechanisms of surface-adsorbed proteins", Proceedings of the National Academy of Sciences, April 11, 2022, 119:e202024211, doi: 10.1073/pnas.2020242119

M. Avaylon, R. Sadre, Z. Bai, T. Perciano, "Adaptable Deep Learning and Probabilistic Graphical Model System for Semantic Segmentation", Advances in Artificial Intelligence and Machine Learnin, March 31, 2022, 2:288--302, doi: 10.54364/AAIML.2022.1119

C Varadharajan, AP Appling, B Arora, DS Christianson, VC Hendrix, V Kumar, AR Lima, J Müller, S Oliver, M Ombadi, T Perciano, JM Sadler, H Weierbach, JD Willard, Z Xu, J Zwart, "Can machine learning accelerate process understanding and decision-relevant predictions of river water quality?", Hydrological Processes, January 1, 2022, 36, doi: 10.1002/hyp.14565

V. Dumont, C. Garner, A. Trivedi, C. Jones, V. Ganapati, J. Mueller, T. Perciano, M. Kiran, and M. Day, "HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization", 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), November 15, 2021,

E. W. Bethel, C. Heinemann, and T. Perciano, "Performance Tradeoffs in Shared-memory Platform Portable Implementations of a Stencil Kernel", Eurographics Symposium on Parallel Graphics and Visualization, June 14, 2021,

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Drew Paine, Sarah Poon, Michael Beach, Alpha N'Diaye, Patrick Huck, Lavanya Ramakrishnan, "Science Capsule: Towards Sharing and Reproducibility of Scientiﬁc Workﬂows", 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), November 15, 2021, doi: 10.1109/WORKS54523.2021.00014

Workflows are increasingly processing large volumes of data from scientific instruments, experiments and sensors. These workflows often consist of complex data processing and analysis steps that might include a diverse ecosystem of tools and also often involve human-in-the-loop steps. Sharing and reproducing these workflows with collaborators and the larger community is critical but hard to do without the entire context of the workflow including user notes and execution environment. In this paper, we describe Science Capsule, which is a framework to capture, share, and reproduce scientific workflows. Science Capsule captures, manages and represents both computational and human elements of a workflow. It automatically captures and processes events associated with the execution and data life cycle of workflows, and lets users add other types and forms of scientific artifacts. Science Capsule also allows users to create `workflow snapshots' that keep track of the different versions of a workflow and their lineage, allowing scientists to incrementally share and extend workflows between users. Our results show that Science Capsule is capable of processing and organizing events in near real-time for high-throughput experimental and data analysis workflows without incurring any significant performance overheads.

Drew Paine, Sarah Poon, Lavanya Ramakrishnan, "Investigating User Experiences with Data Abstractions on High Performance Computing Systems", June 29, 2021, LBNL LBNL-2001374,

Scientific exploration generates expanding volumes of data that commonly require High Performance Computing (HPC) systems to facilitate research. HPC systems are complex ecosystems of hardware and software that frequently are not user friendly. The Usable Data Abstractions (UDA) project set out to build usable software for scientific workflows in HPC environments by undertaking multiple rounds of qualitative user research. Qualitative research investigates how individuals accomplish their work and our interview-based study surfaced a variety of insights about the experiences of working in and with HPC ecosystems. This report examines multiple facets to the experiences of scientists and developers using and supporting HPC systems. We discuss how stakeholders grasp the design and configuration of these systems, the impacts of abstraction layers on their ability to successfully do work, and the varied perceptions of time that shape this work. Examining the adoption of the Cori HPC at NERSC we explore the anticipations and lived experiences of users interacting with this system's novel storage feature, the Burst Buffer. We present lessons learned from across these insights to illustrate just some of the challenges HPC facilities and their stakeholders need to account for when procuring and supporting these essential scientific resources to ensure their usability and utility to a variety of scientific practices.

Patricia Gonzalez-Guerrero, Κylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "Towards practical superconducting accelerators for machine learning using U-SFQ", ACM Journal on Emerging Technologies in Computing Systems, April 2024,

Patricia Gonzalez-Guerrero, Kylie Huch, Nirmalendu Patra, Thom Popovici, George Michelogiannakis, "An Area Efficient Superconducting Unary CNN Accelerator", IEEE 24th International Symposium on Quality Electronic Design (ISQED), IEEE, April 2023,

Md Abdul M Faysal, Shaikh Arifuzzaman, Cy Chan, Maximilian Bremer, Doru Popovici, John Shalf, "HyPC-Map: A Hybrid Parallel Community Detection Algorithm Using Information-Theoretic Approach", HPEC, September 20, 2021,

Raksha Ramakrishna, Anna Scaglione, Tong Wu, Nikhil Ravi, Sean Peisert, "Differential Privacy for Class-based Data: A Practical Gaussian Mechanism", June 23, 2023, doi: 10.1109/TIFS.2023.3289128

Jordan A. Welsman, Gunther H. Weber, Oluwamayowa O. Amusat, Anna Giannakou, Lavanya Ramakrishnan, "Enhancing Electron Microscopy Image Classification Using Data Augmentation", SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, November 17, 2024, 64-71, doi: 10.1109/SCW63240.2024.00016

Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan, "Automated annotation of scientific texts for ML-based keyphrase extraction and validation", Database, September 27, 2024, 2024:baae093, doi: https://doi.org/10.1093/database/baae093

Hector G. Martin, Tijana Radivojevic, Jeremy Zucker, Kristofer Bouchard, Jess Sustarich, Sean Peisert, Dan Arnold, Nathan Hillson, Gyorgy Babnigg, Jose M. Marti, Christopher J. Mungall, Gregg T. Beckham, Lucas Waldburger, James Carothers, ShivShankar Sundaram, Deb Agarwal, Blake A. Simmons, Tyler Backman, Deepanwita Banerjee, Deepti Tanjore, Lavanya Ramakrishnan, Anup Singh, "Perspectives for Self-Driving Labs in Synthetic Biology", Current Opinion in Biotechnology, February 2023, doi: 10.1016/j.copbio.2022.102881

MB Simmonds, WJ Riley, DA Agarwal, X Chen, S Cholia, R Crystal-Ornelas, ET Coon, D Dwivedi, VC Hendrix, M Huang, A Jan, Z Kakalia, J Kumar, CD Koven, L Li, M Melara, L Ramakrishnan, DM Ricciuto, AP Walker, W Zhi, Q Zhu, C Varadharajan, Guidelines for Publicly Archiving Terrestrial Model Data to Enhance Usability, Intercomparison, and Synthesis, Data Science Journal, 2022, doi: 10.5334/dsj-2022-003

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Drew Paine, Sarah Poon, Michael Beach, Alpha N'Diaye, Patrick Huck, Lavanya Ramakrishnan, "Science Capsule: Towards Sharing and Reproducibility of Scientiﬁc Workﬂows", 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), November 15, 2021, doi: 10.1109/WORKS54523.2021.00014

Workflows are increasingly processing large volumes of data from scientific instruments, experiments and sensors. These workflows often consist of complex data processing and analysis steps that might include a diverse ecosystem of tools and also often involve human-in-the-loop steps. Sharing and reproducing these workflows with collaborators and the larger community is critical but hard to do without the entire context of the workflow including user notes and execution environment. In this paper, we describe Science Capsule, which is a framework to capture, share, and reproduce scientific workflows. Science Capsule captures, manages and represents both computational and human elements of a workflow. It automatically captures and processes events associated with the execution and data life cycle of workflows, and lets users add other types and forms of scientific artifacts. Science Capsule also allows users to create `workflow snapshots' that keep track of the different versions of a workflow and their lineage, allowing scientists to incrementally share and extend workflows between users. Our results show that Science Capsule is capable of processing and organizing events in near real-time for high-throughput experimental and data analysis workflows without incurring any significant performance overheads.

Drew Paine, Sarah Poon, Lavanya Ramakrishnan, "Investigating User Experiences with Data Abstractions on High Performance Computing Systems", June 29, 2021, LBNL LBNL-2001374,

Scientific exploration generates expanding volumes of data that commonly require High Performance Computing (HPC) systems to facilitate research. HPC systems are complex ecosystems of hardware and software that frequently are not user friendly. The Usable Data Abstractions (UDA) project set out to build usable software for scientific workflows in HPC environments by undertaking multiple rounds of qualitative user research. Qualitative research investigates how individuals accomplish their work and our interview-based study surfaced a variety of insights about the experiences of working in and with HPC ecosystems. This report examines multiple facets to the experiences of scientists and developers using and supporting HPC systems. We discuss how stakeholders grasp the design and configuration of these systems, the impacts of abstraction layers on their ability to successfully do work, and the varied perceptions of time that shape this work. Examining the adoption of the Cori HPC at NERSC we explore the anticipations and lived experiences of users interacting with this system's novel storage feature, the Burst Buffer. We present lessons learned from across these insights to illustrate just some of the challenges HPC facilities and their stakeholders need to account for when procuring and supporting these essential scientific resources to ensure their usability and utility to a variety of scientific practices.

Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan, "Experiences with Reproducibility: Case Studies from Scientific Workflows", (P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, ACM, June 21, 2021, doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Michael Beach, Drew Paine, Lavanya Ramakrishnan, "Science Capsule - Capturing the Data Life Cycle", Journal of Open Source Software, 2021, 6:2484, doi: 10.21105/joss.02484

D. A. Agarwal, J. Damerow, C. Varadharajan, D. S. Christianson, G. Z. Pastorello, Y.-W. Cheah, L. Ramakrishnan, "Balancing the needs of consumers and producers for scientific data collections", Ecological Informatics, 2021, 62:101251, doi: 10.1016/j.ecoinf.2021.101251

J Müller, B Faybishenko, D Agarwal, S Bailey, C Jiang, Y Ryu, C Tull, L Ramakrishnan, Assessing data change in scientific datasets, Concurrency and Computation: Practice and Experience, 2021, doi: 10.1002/cpe.6245

Damian Rouson, Zhe Bai, Dan Bonachea, Kareem Ergawy, Ethan Gutmann, Michael Klemm, Katherine Rasmussen, Brad Richardson, Sameer Shende, David Torres, Yunhao Zhang, "Automatically parallelizing batch inference on deep neural networks using Fiats and Fortran 2023 `do concurrent`", Fifth International Workshop on Computational Aspects of Deep Learning (CADL), June 2025, doi: 10.25344/S4VG6T

This paper introduces novel programming strategies that leverage features of the Fortran 2023 standard of the International Standards Organization (ISO) to automatically parallelize computations on deep neural networks. The paper focuses on the interplay of object-oriented, parallel, and functional programming paradigms in the Fiats deep learning library. We demonstrate how several infrequently used language features play a role in enabling efficient, parallel execution. Specifically, the ability to explicitly declare that a procedure is pure facilitates inference in the context of the language’s loop-parallelism construct `do concurrent`. Also, explicitly prohibiting the overriding of a parent type’s type-bound procedures eliminates the need for dynamic dispatch in performance-critical code. Finally, this paper uses batch inference calculations on a neural network surrogate for atmospheric aerosol dynamics to demonstrate that LLVM Flang compiler’s automatic parallelization of `do concurrent` achieves roughly the same performance and scalability as achieved by OpenMP compiler directives. We also demonstrate that double-precision inference costs 37–72% longer runtime than default-real precision with most values in the range 57-60%.

Katherine Rasmussen, Damian Rouson, Dan Bonachea, Julienne + Assert == Correctness-Checking for Functional Fortran, Improving Scientific Software Conference, April 2025, doi: 10.25344/S4401K

The agile software development practice of test-driven development (TDD) advocates unit testing as an essential driver of software design and construction. In TDD, tests of individual units of software (e.g., procedures) serve documentation and verification roles. As documentation, tests specify the behaviors required for code correctness. Executing a suite of tests verifies that the actual behaviors satisfy the documented requirements. As inspired by the Veggies and Garden unit testing frameworks for modern Fortran, the more lightweight Julienne framework uses the Template Method pattern to report serial or parallel test results in the form of a specification (https://go.lbl.gov/julienne). As such, Julienne’s test output names the test subject (e.g., a class or type-bound procedure), the expected behavior, the test outcome (pass or fail), and provides diagnostic information if a test fails.

The use of Julienne centers around users defining a test in the form of a non-abstract child type that extends Julienne’s abstract test_t derived type. The user’s child type thus inherits an obligation to define type-bound procedures that name the subject of the test and provide the test results. As a template method, test_t’s type-bound “report” procedure invokes the user’s procedures by referencing the aforementioned deferred bindings and reporting on the collective success or failure across multiple images (processes) in programs that use Fortran’s multi-image parallel programming features.

Working from the example test suite in the Julienne repository, attendees will learn how to write and run a simple test suite, including how to use Julienne’s string-handling for producing rich diagnostic information from a failing test. Attendees will also see examples of Julienne’s use in other Berkeley Lab software projects such as the Fiats deep learning library and Matcha T-cell motility simulator.

Attendees will also learn a functional programming pattern developed and used by the Berkeley Lab Fortran presenters. Functional programming centers around the definition of pure procedures that are free of side effects, including file input and output. To supplement the material on external verification via unit tests, this tutorial will also introduce our Assert utility library and Assert’s use for runtime correctness-checking inside procedures (https://go.lbl.gov/assert). Attendees will learn how Assert addresses a common reason developers cite for not writing pure procedures: a desire to produce diagnostic output when debugging code. We posit that most developers seek output to verify an expectation about data and that such expectations can be stated in assertions that take the form of logical expressions. Attendees will learn how Assert empowers developers to obtain rich, customized diagnostic information through character stop codes when an assertion fails, resulting in error termination. Attendees will also learn how to use Assert in such a way that guarantees zero runtime overhead by automatically eliminating assertions in production builds of user software.

Conference Site

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Caffeine: A parallel runtime library for supporting modern Fortran compilers", Journal of Open Source Software, edited by Daniel S. Katz, March 29, 2025, 10(107), doi: 10.21105/joss.07895

The Fortran programming language standard added features supporting single-program, multiple data (SPMD) parallel programming and loop parallelism beginning with Fortran 2008. In Fortran, SPMD programming involves the creation of a fixed number of images (instances) of a program that execute asynchronously in shared or distributed memory, except where a program uses specific synchronization mechanisms. Fortran’s “coarray’’ distributed data structures offer a subscripted, multidimensional array notation defining a partitioned global address space (PGAS). One image can use this notation for one-sided access to another image’s slice of a coarray.

The CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine) provides a runtime library that supports Fortran’s SPMD features. Caffeine implements inter-process communication by building atop the GASNet-EX exascale networking middleware library. Caffeine is the first implementation of the compiler- and runtime-agnostic Parallel Runtime Interface for Fortran (PRIF) specification. Any compiler that targets PRIF can use any runtime that supports PRIF. Caffeine supports researching the novel approach of writing most of a compiler’s parallel runtime library in the language being compiled: Caffeine is primarily implemented using Fortran’s non-parallel features, with a thin C-language layer that invokes the external GASNet-EX communication library. Exploring this approach in open source lowers a barrier to contributions from the compiler’s users: Fortran programmers. Caffeine also facilitates research such as investigating various optimization opportunities that exploit specific hardware such as shared memory or specific interconnects.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.5", Lawrence Berkeley National Laboratory Tech Report, December 2024, LBNL 2001636, doi: 10.25344/S4CG6G

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is primarily responsible for implementing coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, teams and collective subroutines. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF subroutines. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler's own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF): A Multi-Image Solution for LLVM Flang", Tenth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC2024), Atlanta, Georgia, USA, IEEE, November 2024, doi: 10.25344/S4N017

Download File: LLVM-HPC24_PRIF_Slides.pdf (pdf: 975 KB)

Fortran compilers that provide support for Fortran’s native parallel features often do so with a runtime library that depends on details of both the compiler implementation and the communication library, while others provide limited or no support at all. This paper introduces a new generalized interface that is both compiler- and runtime-library-agnostic, providing flexibility while fully supporting all of Fortran’s parallel features. The Parallel Runtime Interface for Fortran (PRIF) was developed to be portable across shared- and distributed-memory systems, with varying operating systems, toolchains and architectures. It achieves this by defining a set of Fortran procedures corresponding to each of the parallel features defined in the Fortran standard that may be invoked by a Fortran compiler and implemented by a runtime library. PRIF aims to be used as the solution for LLVM Flang to provide parallel Fortran support. This paper also briefly describes our PRIF prototype implementation: Caffeine.

Talk Slides

Damian Rouson, Baboucarr Dibba, Katherine Rasmussen, Brad Richardson, David Torres, Yunhao Zhang, Ethan Gutmann, Kareem Ergawy, Michael Klemm, Sameer Shende, Just Write Fortran: Experiences with a Language-Based Alternative to MPI+X, Talk at IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2024, doi: 10.25344/S4H88D

Fortran 2023, with its "do concurrent" and coarray parallel programming features, displaces many uses of extra-language parallel programming models such as MPI, OpenMP, and OpenACC. The Cray, Intel, LFortran, LLVM, and NVIDIA compilers automatically parallelize do concurrent in shared memory. The Cray, Intel, and GNU compilers support coarrays in shared- and distributed-memory, while the NAG compiler supports coarrays in shared memory. Thus, language-based parallelism is emerging as a portable alternative to MPI+X.

This talk will present experiences with automatic "do concurrent" parallelization in the deep learning library Inference-Engine and coarray communication in the Intermediate Complexity Atmospheric Research (ICAR), respectively.

PAW-ATM24

Katherine Rasmussen, Damian Rouson, Dan Bonachea, Brad Richardson, "A Full-Stack Exploration of Language-Based Parallelism in Fortran 2023", Poster at CARLA2024: Latin America High Performance Computing Conference, September 30, 2024, doi: 10.25344/S4RP5K

This poster explores native parallel features in Fortran 2023 through the lens of supporting applications with libraries, compilers, and parallel runtimes. The language revision informally named Fortran 2008 introduced parallelism in the form of Single Program Multiple Data (SPMD) execution with two broad feature sets: (1) loop-level parallelism via do concurrent and (2) a Partitioned Global Address Space (PGAS) comprised of distributed “coarray” data structures. Fortran’s native parallelism has demonstrated high performance [1] and reduced the burden of inserting what sometimes amounts to more directives than code. Several compilers support both feature sets, typically by translating do concurrent into serial do loops annotated by parallel directives and by translating SPMD/PGAS features into direct calls to a communication library. Our research focuses primarily on two questions: (1) can the compiler’s parallel runtime library be developed in the language being compiled (Fortran) and (2) can we define an interface to the runtime that liberates compilers from being hardwired to one runtime and vice versa. We are answering these questions by developing the Parallel Runtime Interface for Fortran (PRIF) [2] and the Co-Array Fortran Framework of Efficient Interfaces to Network Environments (Caffeine) [3]. Caffeine is initially targeting adoption by LLVM Flang, a new open-source Fortran compiler developed by a broad community in industry, academia, and government labs. We are also exploring the use of these features in Inference-Engine, a deep learning library designed to facilitate neural network training and inference for high-performance computing applications written in modern Fortran.

CARLA'2024

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.4", Lawrence Berkeley National Laboratory Tech Report, July 12, 2024, LBNL 2001604, doi: 10.25344/S4WG64

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, Parallel Runtime Interface for Fortran (PRIF): A Compiler/Runtime-Library Agnostic Interface to Support the Parallel Features of Fortran 2023, Platform for Advanced Scientific Computing (PASC) Modern Fortran Minisymposium, June 5, 2024,

Download File: PRIF-PASC24.pdf (pdf: 1.6 MB)

Fortran 2023 natively supports single-program, multiple-data parallel programming with a partitioned global address space and collective subroutines, synchronization, atomics, locks, and more. Each of the four actively developed compilers that support Fortran’s parallel features uses its own parallel runtime library. The Parallel Runtime Interface for Fortran (PRIF) proposes to liberate compiler development from reliance on a single runtime and empower runtime developers to support more than one compiler. PRIF also aims to broaden the community of runtime developers to include the Fortran compiler’s users: Fortran programmers. PRIF does so by specifying the interface in Fortran, which makes it attractive to write the parallel runtime library in Fortran. Additionally, PRIF has been designed to be portable across both shared and distributed memory, varying architectures, as well as different operating systems. In this talk, I will describe the motivation behind the development of PRIF, describe the design of the interface itself and the benefits of adopting it. I will also provide a brief status report on the first PRIF implementation: Caffeine.

PASC'24 site

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.3", Lawrence Berkeley National Laboratory Tech Report, May 3, 2024, LBNL 2001590, doi: 10.25344/S4501W

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Damian Rouson, Brad Richardson, Dan Bonachea, Katherine Rasmussen, "Parallel Runtime Interface for Fortran (PRIF) Design Document, Revision 0.2", Lawrence Berkeley National Laboratory Tech Report, December 20, 2023, LBNL 2001563, doi: 10.25344/S4DG6S

This design document proposes an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Katherine Rasmussen, Damian Rouson, Naje George, Dan Bonachea, Hussain Kadhem, Brian Friesen, "Agile Acceleration of LLVM Flang Support for Fortran 2018 Parallel Programming", Research Poster at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), November 2022, doi: 10.25344/S4CP4S

The LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “Coarray Fortran.” We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).

Extended Abstract and Poster

Video presentation

Robert Currie, Sean Peisert, Anna Scaglione, Aram Shumavon, Nikhil Ravi, "Data Privacy for the Grid: Toward a Data Privacy Standard for Inverter-Based and Distributed Energy Resources", IEEE Power & Energy Magazine, October 1, 2023,

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, "Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets", Proceedings of the 2023 IEEE International Conference on Smart Applications, Communications and Networking (SmartNets), Istanbul, Turkey, July 25, 2023,

Raksha Ramakrishna, Anna Scaglione, Tong Wu, Nikhil Ravi, Sean Peisert, "Differential Privacy for Class-based Data: A Practical Gaussian Mechanism", June 23, 2023, doi: 10.1109/TIFS.2023.3289128

Nikhil Ravi, Anna Scaglione, Julieta Giraldez, Parth Pradhan, Chuck Moran, Sean Peisert, "Solar Photovoltaic Systems Metadata Inference and Differentially Private Publication", arXiv preprint arXiv:2304.03749, April 7, 2023, doi: 10.48550/arXiv.2304.03749

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets, arXiv preprint arXiv:2111.11661, November 23, 2021,

Nikhil Ravi, Anna Scaglione, Sachin Kadam, Reinhard Gentz, Sean Peisert, Brent Lunghino, Emmanuel Levijarvi, Aram Shumavon, Differentially Private K-means Clustering Applied to Meter Data Analysis and Synthesis, arXiv preprint arXiv:2112.03801, November 23, 2021,

Nikhil Ravi, Anna Scaglione, Sean Peisert, Colored Noise Mechanism for Differentially Private Clustering, arXiv preprint arXiv:2111.07850, November 15, 2021,

Brandon Cook, Damian Rouson, Dan Bonachea, "US04: Non-blocking Collective Subroutines", JTC1/SC22/WG5 ISO Fortran Standards document (WG5/N2245), June 2025,

Proposal for adding explicitly non-blocking collective subroutines to the worklist for Fortran 202Y.

Damian Rouson, Zhe Bai, Dan Bonachea, Kareem Ergawy, Ethan Gutmann, Michael Klemm, Katherine Rasmussen, Brad Richardson, Sameer Shende, David Torres, Yunhao Zhang, "Automatically parallelizing batch inference on deep neural networks using Fiats and Fortran 2023 `do concurrent`", Fifth International Workshop on Computational Aspects of Deep Learning (CADL), June 2025, doi: 10.25344/S4VG6T

This paper introduces novel programming strategies that leverage features of the Fortran 2023 standard of the International Standards Organization (ISO) to automatically parallelize computations on deep neural networks. The paper focuses on the interplay of object-oriented, parallel, and functional programming paradigms in the Fiats deep learning library. We demonstrate how several infrequently used language features play a role in enabling efficient, parallel execution. Specifically, the ability to explicitly declare that a procedure is pure facilitates inference in the context of the language’s loop-parallelism construct `do concurrent`. Also, explicitly prohibiting the overriding of a parent type’s type-bound procedures eliminates the need for dynamic dispatch in performance-critical code. Finally, this paper uses batch inference calculations on a neural network surrogate for atmospheric aerosol dynamics to demonstrate that LLVM Flang compiler’s automatic parallelization of `do concurrent` achieves roughly the same performance and scalability as achieved by OpenMP compiler directives. We also demonstrate that double-precision inference costs 37–72% longer runtime than default-real precision with most values in the range 57-60%.

Katherine Rasmussen, Damian Rouson, Dan Bonachea, Julienne + Assert == Correctness-Checking for Functional Fortran, Improving Scientific Software Conference, April 2025, doi: 10.25344/S4401K

The agile software development practice of test-driven development (TDD) advocates unit testing as an essential driver of software design and construction. In TDD, tests of individual units of software (e.g., procedures) serve documentation and verification roles. As documentation, tests specify the behaviors required for code correctness. Executing a suite of tests verifies that the actual behaviors satisfy the documented requirements. As inspired by the Veggies and Garden unit testing frameworks for modern Fortran, the more lightweight Julienne framework uses the Template Method pattern to report serial or parallel test results in the form of a specification (https://go.lbl.gov/julienne). As such, Julienne’s test output names the test subject (e.g., a class or type-bound procedure), the expected behavior, the test outcome (pass or fail), and provides diagnostic information if a test fails.

The use of Julienne centers around users defining a test in the form of a non-abstract child type that extends Julienne’s abstract test_t derived type. The user’s child type thus inherits an obligation to define type-bound procedures that name the subject of the test and provide the test results. As a template method, test_t’s type-bound “report” procedure invokes the user’s procedures by referencing the aforementioned deferred bindings and reporting on the collective success or failure across multiple images (processes) in programs that use Fortran’s multi-image parallel programming features.

Working from the example test suite in the Julienne repository, attendees will learn how to write and run a simple test suite, including how to use Julienne’s string-handling for producing rich diagnostic information from a failing test. Attendees will also see examples of Julienne’s use in other Berkeley Lab software projects such as the Fiats deep learning library and Matcha T-cell motility simulator.

Attendees will also learn a functional programming pattern developed and used by the Berkeley Lab Fortran presenters. Functional programming centers around the definition of pure procedures that are free of side effects, including file input and output. To supplement the material on external verification via unit tests, this tutorial will also introduce our Assert utility library and Assert’s use for runtime correctness-checking inside procedures (https://go.lbl.gov/assert). Attendees will learn how Assert addresses a common reason developers cite for not writing pure procedures: a desire to produce diagnostic output when debugging code. We posit that most developers seek output to verify an expectation about data and that such expectations can be stated in assertions that take the form of logical expressions. Attendees will learn how Assert empowers developers to obtain rich, customized diagnostic information through character stop codes when an assertion fails, resulting in error termination. Attendees will also learn how to use Assert in such a way that guarantees zero runtime overhead by automatically eliminating assertions in production builds of user software.

Conference Site

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Caffeine: A parallel runtime library for supporting modern Fortran compilers", Journal of Open Source Software, edited by Daniel S. Katz, March 29, 2025, 10(107), doi: 10.21105/joss.07895

The Fortran programming language standard added features supporting single-program, multiple data (SPMD) parallel programming and loop parallelism beginning with Fortran 2008. In Fortran, SPMD programming involves the creation of a fixed number of images (instances) of a program that execute asynchronously in shared or distributed memory, except where a program uses specific synchronization mechanisms. Fortran’s “coarray’’ distributed data structures offer a subscripted, multidimensional array notation defining a partitioned global address space (PGAS). One image can use this notation for one-sided access to another image’s slice of a coarray.

The CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine) provides a runtime library that supports Fortran’s SPMD features. Caffeine implements inter-process communication by building atop the GASNet-EX exascale networking middleware library. Caffeine is the first implementation of the compiler- and runtime-agnostic Parallel Runtime Interface for Fortran (PRIF) specification. Any compiler that targets PRIF can use any runtime that supports PRIF. Caffeine supports researching the novel approach of writing most of a compiler’s parallel runtime library in the language being compiled: Caffeine is primarily implemented using Fortran’s non-parallel features, with a thin C-language layer that invokes the external GASNet-EX communication library. Exploring this approach in open source lowers a barrier to contributions from the compiler’s users: Fortran programmers. Caffeine also facilitates research such as investigating various optimization opportunities that exploit specific hardware such as shared memory or specific interconnects.

Damian Rouson, What Happens to a Dream Deferred? Chasing Language-Based Parallel Programming for HPC and AI, SIAM Conference on Computational Science and Engineering (CSE25), March 5, 2025, doi: 10.25344/S47S36

In 1951, Harlem Renaissance poet Langston Hughes asked this talk's titular question at the outset of a poem entitled "Harlem." Six years later, IBM mathematician John Backus developed Fortran, the world's first widely used high-level programming language. Backus later explored functional programming and highlighted the functional style in his Turing Award lecture in 1977, a year that also demarcates what one might consider the end of the classical era of Fortran. Building on a vision the presenter first conceived around the turn of the 21st century while teaching in Harlem, this talk will demonstrate how Fortran 2023 can finally deliver on Backus's functional programming dream in traditional high-performance computing (HPC) domains such as partial differential equation (PDE) solvers and in emerging domains such as artificial intelligence (AI). For PDE solvers, the talk will describe language facilities for asynchronously evaluating expressions that apply discrete, parallel, purely-functional differential operators to software abstractions that model continuous mathematical abstractions. For AI, the talk will demonstrate that Fortran's native concurrent loop iterations can combine with side-effect-free, pure procedures to facilitate automatically parallelizing deep-learning inference and training algorithms on processors and accelerators. The talk will provide updates on an ongoing effort by Berkeley Lab's Fortran team to realize this dream by through our work at multiple levels in the software stack, including applications, compiler runtime libraries, and networking middleware. Along the way, the talk will highlight ways in which programs promoting inclusivity in science facilitated significant aspects of the presented work.

SIAM Conference on Computational Science and Engineering (CSE25)

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.5", Lawrence Berkeley National Laboratory Tech Report, December 2024, LBNL 2001636, doi: 10.25344/S4CG6G

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is primarily responsible for implementing coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, teams and collective subroutines. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF subroutines. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler's own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF): A Multi-Image Solution for LLVM Flang", Tenth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC2024), Atlanta, Georgia, USA, IEEE, November 2024, doi: 10.25344/S4N017

Download File: LLVM-HPC24_PRIF_Slides.pdf (pdf: 975 KB)

Fortran compilers that provide support for Fortran’s native parallel features often do so with a runtime library that depends on details of both the compiler implementation and the communication library, while others provide limited or no support at all. This paper introduces a new generalized interface that is both compiler- and runtime-library-agnostic, providing flexibility while fully supporting all of Fortran’s parallel features. The Parallel Runtime Interface for Fortran (PRIF) was developed to be portable across shared- and distributed-memory systems, with varying operating systems, toolchains and architectures. It achieves this by defining a set of Fortran procedures corresponding to each of the parallel features defined in the Fortran standard that may be invoked by a Fortran compiler and implemented by a runtime library. PRIF aims to be used as the solution for LLVM Flang to provide parallel Fortran support. This paper also briefly describes our PRIF prototype implementation: Caffeine.

Talk Slides

Damian Rouson, Baboucarr Dibba, Katherine Rasmussen, Brad Richardson, David Torres, Yunhao Zhang, Ethan Gutmann, Kareem Ergawy, Michael Klemm, Sameer Shende, Just Write Fortran: Experiences with a Language-Based Alternative to MPI+X, Talk at IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2024, doi: 10.25344/S4H88D

Fortran 2023, with its "do concurrent" and coarray parallel programming features, displaces many uses of extra-language parallel programming models such as MPI, OpenMP, and OpenACC. The Cray, Intel, LFortran, LLVM, and NVIDIA compilers automatically parallelize do concurrent in shared memory. The Cray, Intel, and GNU compilers support coarrays in shared- and distributed-memory, while the NAG compiler supports coarrays in shared memory. Thus, language-based parallelism is emerging as a portable alternative to MPI+X.

This talk will present experiences with automatic "do concurrent" parallelization in the deep learning library Inference-Engine and coarray communication in the Intermediate Complexity Atmospheric Research (ICAR), respectively.

PAW-ATM24

Mary Ann Leung, Lois Curfman McInnes, Daniel Martin, Suzanne Parete-Koon, Ann Almgren, David E. Bernholdt, Beth Cerny, Anshu Dubey, William Godoy, Elsa Gonsiorowski, Mahantesh Halappanavar, Rebecca Hartman-Baker, Michael Heroux, Denice Ward Hood, Terry Jones, Paige Kinsley, Jeffrey Larson, Mark C. Miller, Todd Munson, Olivia B. Newton, Erik Palmer, Elaine M. Raybourn, Damian Rouson, Sameer Shende, Keita Teranishi, Matteo Turilli, Terece Turton, Carol Woodward, Ulrike Yang, "Cultivating an AI-Ready Scientific Workforce through Partnerships for FASST", November 2024, doi: 10.6084/m9.figshare.27674973.v2

The U.S. Department of Energy (DOE) is a longstanding leader in scientific discovery enabled through high-performance computing (HPC) and more recently through AI. The DOE continues its advanced computing leadership through the proposed Frontiers in AI for Science, Security, and Technology (FASST) initiative, which envisions building the world’s most powerful, integrated scientific AI models for scientific discovery, applied energy development, and national security. This response addresses question #5 of the Request for Information on Frontiers in AI for Science, Security, and Technology (FASST) Initiative:

Workforce: DOE has an inventory of AI workforce training programs underway through our national labs. What other partnerships or convenings could DOE host or develop to support an AI-ready scientific workforce in the United States?

For this question, we focus on partnerships needed to foster a broad and inclusive AI-ready workforce for science, energy, and security, with emphasis on skills needed for the computing sciences—to produce and maintain high-quality, trustworthy scientific models and software for AI, as well as to leverage AI technologies for novel research and development. We outline existing multi-institutional collaborations aimed at addressing pressing workforce issues related to the FASST initiative and suggest partnerships to extend these activities to address frontiers in AI for science, security, and technology.

The community would benefit from convenings to discuss these issues and related AI workforce partnership topics, including building understanding of scientific workforce needs in an AI-driven future; broad recruitment to reflect the available talent pool, with an eye toward including individuals with perspectives and/or training on issues related to ensuring findable, accessible, interoperable, and reusable (FAIR) AI; training and developing the workforce; and fostering community, with the overall goal of creating a robust and inclusive AI workforce ecosystem for science, security, and technology.

Katherine Rasmussen, Damian Rouson, Dan Bonachea, Brad Richardson, "A Full-Stack Exploration of Language-Based Parallelism in Fortran 2023", Poster at CARLA2024: Latin America High Performance Computing Conference, September 30, 2024, doi: 10.25344/S4RP5K

This poster explores native parallel features in Fortran 2023 through the lens of supporting applications with libraries, compilers, and parallel runtimes. The language revision informally named Fortran 2008 introduced parallelism in the form of Single Program Multiple Data (SPMD) execution with two broad feature sets: (1) loop-level parallelism via do concurrent and (2) a Partitioned Global Address Space (PGAS) comprised of distributed “coarray” data structures. Fortran’s native parallelism has demonstrated high performance [1] and reduced the burden of inserting what sometimes amounts to more directives than code. Several compilers support both feature sets, typically by translating do concurrent into serial do loops annotated by parallel directives and by translating SPMD/PGAS features into direct calls to a communication library. Our research focuses primarily on two questions: (1) can the compiler’s parallel runtime library be developed in the language being compiled (Fortran) and (2) can we define an interface to the runtime that liberates compilers from being hardwired to one runtime and vice versa. We are answering these questions by developing the Parallel Runtime Interface for Fortran (PRIF) [2] and the Co-Array Fortran Framework of Efficient Interfaces to Network Environments (Caffeine) [3]. Caffeine is initially targeting adoption by LLVM Flang, a new open-source Fortran compiler developed by a broad community in industry, academia, and government labs. We are also exploring the use of these features in Inference-Engine, a deep learning library designed to facilitate neural network training and inference for high-performance computing applications written in modern Fortran.

CARLA'2024

David J. Torres, Damian Rouson, "Investigating the ecological fallacy through sampling distributions constructed from finite populations", Monte Carlo Methods and Applications, August 2024, doi: 10.1515/mcma-2024-2013

Correlation coefficients and linear regression values computed from group averages can differ from correlation coefficients and linear regression values computed using individual scores. This observation known as the ecological fallacy often assumes that all the individual scores are available from a population. In many situations, one must use a sample from the larger population. In such cases, the computed correlation coefficient and linear regression values will depend on the sample that is chosen and the underlying sampling distribution. The sampling distribution of correlation coefficients and linear regression values for group averages will be identical to the sampling distribution for individuals for normally distributed variables for random samples drawn from infinitely large continuous distributions. However, data that is acquired in practice is often acquired when sampling without replacement from a finite population. Our objective is to demonstrate through Monte Carlo simulations that the sampling distributions for correlation and linear regression will also be similar for individuals and group averages when sampling without replacement from normally distributed variables. These simulations suggest that when a random sample from a population is selected, the correlation coefficients and linear regression values computed from individual scores will not be more accurate in estimating the entire population values compared to samples when group averages are used as long as the sample size is the same.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.4", Lawrence Berkeley National Laboratory Tech Report, July 12, 2024, LBNL 2001604, doi: 10.25344/S4WG64

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, Parallel Runtime Interface for Fortran (PRIF): A Compiler/Runtime-Library Agnostic Interface to Support the Parallel Features of Fortran 2023, Platform for Advanced Scientific Computing (PASC) Modern Fortran Minisymposium, June 5, 2024,

Download File: PRIF-PASC24.pdf (pdf: 1.6 MB)

Fortran 2023 natively supports single-program, multiple-data parallel programming with a partitioned global address space and collective subroutines, synchronization, atomics, locks, and more. Each of the four actively developed compilers that support Fortran’s parallel features uses its own parallel runtime library. The Parallel Runtime Interface for Fortran (PRIF) proposes to liberate compiler development from reliance on a single runtime and empower runtime developers to support more than one compiler. PRIF also aims to broaden the community of runtime developers to include the Fortran compiler’s users: Fortran programmers. PRIF does so by specifying the interface in Fortran, which makes it attractive to write the parallel runtime library in Fortran. Additionally, PRIF has been designed to be portable across both shared and distributed memory, varying architectures, as well as different operating systems. In this talk, I will describe the motivation behind the development of PRIF, describe the design of the interface itself and the benefits of adopting it. I will also provide a brief status report on the first PRIF implementation: Caffeine.

PASC'24 site

Damian Rouson, What Happens to a Dream Deferred? Chasing Automatic Offloading in Fortran 2023, Keynote Talk at the Nineteenth International Workshop on Automatic Performance Tuning (iWAPT 2024), May 31, 2024,

Download File: iWAPT-2024-Keynote.pdf (pdf: 6.7 MB)

In 1951, Harlem Renaissance poet Langston Hughes asked this talk's titular question at the outset of a poem entitled "Harlem." Six years later, IBM mathematician John Backus developed Fortran, the world's first widely used high-level programming language. Backus went on to explore functional programming and to highlight the functional style in his Turing Award lecture in 1977, a year that also demarcates what one might consider the end of the classical era of Fortran. This talk will demonstrate how modern Fortran began to deliver on Backus's functional programming dream, starting with pure procedures in the 1995 standard. The talk will further demonstrate how this style culminated in a powerful and flexible facility for expressing independent iterations via the "do concurrent" construct, which the Fortran standard committee included in Fortran 2008 with the intention to facilitate automatic Graphics Processing Unit (GPU) programming. Fortran 2008 was published in 2010, but it took another decade for compilers to deliver on the promise of automatic GPU offloading. This talk will detail the trials and tribulations of Berkeley Lab's Fortran team in chasing the automatic offloading dream in our Inference-Engine deep learning library and Matcha high-performance computing (HPC) application.

Dan Bonachea, Katherine Rasmussen, Brad Richardson, Damian Rouson, "Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.3", Lawrence Berkeley National Laboratory Tech Report, May 3, 2024, LBNL 2001590, doi: 10.25344/S4501W

This document specifies an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Brad Richardson, Damian Rouson, Harris Snyder, Robert Singleterry, "Scheduling and Performance of Asynchronous Tasks in Fortran 2018 with FEATS", SN Computer Science, March 2024, 5 (354), doi: 10.1007/s42979-024-02682-y

Most parallel scientific programs contain compiler directives (pragmas) such as those from OpenMP, explicit calls to runtime library procedures such as those implementing the Message Passing Interface (MPI), or compiler-specific language extensions such as those provided by CUDA. By contrast, the recent Fortran standards empower developers to express parallel algorithms without directly referencing lower-level parallel programming models. Fortran’s parallel features place the language within the Partitioned Global Address Space (PGAS) class of programming models. When writing programs that exploit data parallelism, application developers often find it straightforward to develop custom parallel algorithms. Problems involving complex, heterogeneous, staged calculations, however, pose much greater challenges. Such applications require careful coordination of tasks in a manner that respects dependencies prescribed by a directed acyclic graph. When rolling one’s own solution proves difficult, extending a customizable framework becomes attractive. The paper presents the design, implementation, and use of the Framework for Extensible Asynchronous Task Scheduling (FEATS), which we believe to be the first task scheduling tool written in modern Fortran. We describe the benefits and compromises associated with choosing Fortran as the implementation language, and we propose ways in which future Fortran standards can best support the use case in this paper.

Damian Rouson, Brad Richardson, Dan Bonachea, Katherine Rasmussen, "Parallel Runtime Interface for Fortran (PRIF) Design Document, Revision 0.2", Lawrence Berkeley National Laboratory Tech Report, December 20, 2023, LBNL 2001563, doi: 10.25344/S4DG6S

This design document proposes an interface to support the parallel features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a proposed solution in which the runtime library is responsible for coarray allocation, deallocation and accesses, image synchronization, atomic operations, events, and teams. In this interface, the compiler is responsible for transforming the invocation of Fortran-level parallel features into procedure calls to the necessary PRIF procedures. The interface is designed for portability across shared- and distributed-memory machines, different operating systems, and multiple architectures. Implementations of this interface are intended as an augmentation for the compiler’s own runtime library. With an implementation-agnostic interface, alternative parallel runtime libraries may be developed that support the same interface. One benefit of this approach is the ability to vary the communication substrate. A central aim of this document is to define a parallel runtime interface in standard Fortran syntax, which enables us to leverage Fortran to succinctly express various properties of the procedure interfaces, including argument attributes.

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran, Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23), November 12, 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models.

The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.

Come join us to learn about some productive and performant parallel programming models!

SC23 event page

A. Dubey, T. Ben-Nun, B. L. Chamberlain, B. R. de Supinski, D. Rouson, "Performance on HPC Platforms Is Possible Without C++", Computing in Science & Engineering, September 2023, 25 (5):48-52, doi: 10.1109/MCSE.2023.3329330

Computing at large scales has become extremely challenging due to increasing heterogeneity in both hardware and software. More and more scientific workflows must tackle a range of scales and use machine learning and AI intertwined with more traditional numerical modeling methods, placing more demands on computational platforms. These constraints indicate a need to fundamentally rethink the way computational science is done and the tools that are needed to enable these complex workflows. The current set of C++-based solutions may not suffice, and relying exclusively upon C++ may not be the best option, especially because several newer languages and boutique solutions offer more robust design features to tackle the challenges of heterogeneity. In June 2023, we held a mini symposium that explored the use of newer languages and heterogeneity solutions that are not tied to C++ and that offer options beyond template metaprogramming and Parallel. For for performance and portability. We describe some of the presentations and discussion from the mini symposium in this article.

Michelle Mills Strout, Damian Rouson, Amir Kamil, Dan Bonachea, Jeremiah Corrado, Paul H. Hargrove, Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran (CUF23), ECP/NERSC/OLCF Tutorial, July 2023,

A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models. This tutorial should be accessible to users with little-to-no parallel programming experience, and everyone is welcome. A partial differential equation example will be demonstrated in all three programming models along with performance and scaling results on big machines. That example and others will be provided in a cloud instance and Docker container. Attendees will be shown how to compile and run these programming examples, and provided opportunities to experiment with different parameters and code alternatives while being able to ask questions and share their own observations. Come join us to learn about some productive and performant parallel programming models!

Secondary tutorial sites by event sponsors:

Damian Rouson, Producing Software for Science with Class, SIAM Conference on Computational Science and Engineering, March 1, 2023,

Download File: Rouson-SIAM-CSE-2023.pdf (pdf: 7.5 MB)

The Computer Languages and Systems Software (CLaSS) Group at Berkeley Lab researches and develops programming models, languages, libraries, and applications for parallel and quantum computing. The open-source software under development in CLaSS includes the GASNet-EX networking middleware, the UPC++ partitioned global address space (PGAS) template library, the Berkeley Quantum Synthesis Toolkit (BQSKit), and the MetaHipMer metagenome assembler. This talk will start with an overview of CLaSS software and the software sustainability practices commonly employed across the group. The talk will then dive more deeply into the our burgeoning contributions to the ecosystem supporting modern Fortran, including our test development for the LLVM Flang Fortran compiler. This presentation will demonstrate how agile software development techniques are helping to ensure robust front-end support for standard Fortran 2018 parallel programming features. The talk will also present several key insights that inspired our design and development of the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine) parallel runtime library, emphasizing the design choices that help to ensure sustainability. Lastly, the talk will demonstrate the productivity benefits associated with the first Caffeine application in Motility Analysis of T-Cell Histories in Activation (Matcha).

SIAM Session

Brad Richardson, Damian Rouson, Harris Snyder, Robert Singelterry, "Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran", Workshop on Asynchronous Many-Task Systems and Applications (WAMTA'23), Baton Rouge, LA, February 2023, doi: 10.25344/S4ZC73

Most parallel scientific programs contain compiler directives (pragmas) such as those from OpenMP, explicit calls to runtime library procedures such as those implementing the Message Passing Interface (MPI), or compiler-specific language extensions such as those provided by CUDA. By contrast, the recent Fortran standards empower developers to express parallel algorithms without directly referencing lower-level parallel programming models. Fortran’s parallel features place the language within the Partitioned Global Address Space (PGAS) class of programming models. When writing programs that exploit data-parallelism, application developers often find it straightforward to develop custom parallel algorithms. Problems involving complex, heterogeneous, staged calculations, however, pose much greater challenges. Such applications require careful coordination of tasks in a manner that respects dependencies prescribed by a directed acyclic graph. When rolling one’s own solution proves difficult, extending a customizable framework becomes attractive. The paper presents the design, implementation, and use of the Framework for Extensible Asynchronous Task Scheduling (FEATS), which we believe to be the first task-scheduling tool written in modern Fortran. We describe the benefits and compromises associated with choosing Fortran as the implementation language, and we propose ways in which future Fortran standards can best support the use case in this paper.

Paul H. Hargrove, Dan Bonachea, Johnny Corbino, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'23)", Poster at Exascale Computing Project (ECP) Annual Meeting 2023, January 2023,

The Pagoda project is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. The first component is GASNet-EX, a portable, high-performance, global-address-space communication library. The second component is UPC++, a C++ template library. Together, these libraries enable agile, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

GASNet-EX is a portable, high-performance communications middleware library which leverages hardware support to implement Remote Memory Access (RMA) and Active Message communication primitives. GASNet-EX supports a broad ecosystem of alternative HPC programming models, including UPC++, Legion, Chapel and multiple implementations of UPC and Fortran Coarrays. GASNet-EX is implemented directly over the native APIs for networks of interest in HPC. The tight semantic match of GASNet-EX APIs to the client requirements and hardware capabilities often yields better performance than competing libraries.

UPC++ provides high-level productivity abstractions appropriate for Partitioned Global Address Space (PGAS) programming such as: remote memory access (RMA), remote procedure call (RPC), support for accelerators (e.g. GPUs), and mechanisms for aggressive asynchrony to hide communication costs. UPC++ implements communication using GASNet-EX, delivering high performance and portability from laptops to exascale supercomputers. HPC application software using UPC++ includes: MetaHipMer2 metagenome assembler, SIMCoV viral propagation simulation, NWChemEx TAMM, and graph computation kernels from ExaGraph.

"Berkeley Lab’s Networking Middleware GASNet Turns 20: Now, GASNet-EX is Gearing Up for the Exascale Era", Linda Vu, HPCWire (Lawrence Berkeley National Laboratory CS Area Communications), December 7, 2022, doi: 10.25344/S4BP4G

GASNet Celebrates 20th Anniversary

For 20 years, Berkeley Lab’s GASNet has been fueling developers’ ability to tap the power of massively parallel supercomputers more effectively. The middleware was recently upgraded to support exascale scientific applications.

Katherine Rasmussen, Damian Rouson, Naje George, Dan Bonachea, Hussain Kadhem, Brian Friesen, "Agile Acceleration of LLVM Flang Support for Fortran 2018 Parallel Programming", Research Poster at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), November 2022, doi: 10.25344/S4CP4S

The LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “Coarray Fortran.” We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).

Extended Abstract and Poster

Video presentation

Damian Rouson, Dan Bonachea, "Caffeine: CoArray Fortran Framework of Efficient Interfaces to Network Environments", Proceedings of the Eighth Annual Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC2022), Dallas, Texas, USA, IEEE, November 2022, doi: 10.25344/S4459B

This paper provides an introduction to the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine), a parallel runtime library built atop the GASNet-EX exascale networking library. Caffeine leverages several non-parallel Fortran features to write type- and rank-agnostic interfaces and corresponding procedure definitions that support parallel Fortran 2018 features, including communication, collective operations, and related services. One major goal is to develop a runtime library that can eventually be considered for adoption by LLVM Flang, enabling that compiler to support the parallel features of Fortran. The paper describes the motivations behind Caffeine's design and implementation decisions, details the current state of Caffeine's development, and previews future work. We explain how the design and implementation offer benefits related to software sustainability by lowering the barrier to user contributions, reducing complexity through the use of Fortran 2018 C-interoperability features, and high performance through the use of a lightweight communication substrate.

Talk Slides

William F. Godoy, Ritu Arora, Keith Beattie, David E. Bernholdt, Sarah E. Bratt, Daniel S. Katz, Ignacio Laguna, Amiya K. Maji, Addi Malviya-Thakur, Rafael M. Mudafort, Nitin Sukhija, Damian Rouson, Cindy Rubio-Gonzalez, Karan Vahi, "Giving Research Software Engineers a Larger Stage Through the Better Scientific Software Fellowship", Computing in Science & Engineering, October 2022, 24 (5):6-13, doi: 10.1109/MCSE.2023.3253847

The Better Scientific Software Fellowship (BSSwF) was launched in 2018 to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific codes. The BSSwF’s vision is to grow the community with practitioners, leaders, mentors, and consultants to increase the visibility of scientific software. Over the last five years, many fellowship recipients and honorable mentions have identified as research software engineers (RSEs). Case studies from several of the program’s participants illustrate the diverse ways the BSSwF has benefited both the RSE and scientific communities. In an environment where the contributions of RSEs are too often undervalued, we believe that programs such as the BSSwF can help recognize and encourage community members to step outside of their regular commitments and expand on their work, collaborations, and ideas for a larger audience.

Paul H. Hargrove, Dan Bonachea, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'22)", Poster at Exascale Computing Project (ECP) Annual Meeting 2022, May 5, 2022,

We present UPC++ and GASNet-EX, distributed libraries which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Procedure Call (RPC) and for Remote Memory Access (RMA) to host and GPU memories. The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems.

Katherine A. Yelick, Amir Kamil, Damian Rouson, Dan Bonachea, Paul H. Hargrove, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC21), Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), November 15, 2021,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. UPC++ offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between computation and asynchronous data movement. UPC++ supports simple/regular data structures as well as more elaborate distributed applications where communication is fine-grained and/or irregular. UPC++ provides a uniform abstraction for one-sided RMA between host and GPU/accelerator memories anywhere in the system. UPC++'s support for aggressive asynchrony enables applications to effectively overlap communication and reduce latency stalls, while the underlying GASNet-EX communication library delivers efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces UPC++, covering the memory and execution models and basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into application proxy examples. We examine a few UPC++ applications with irregular communication (metagenomic assembler and COVID-19 simulation) and describe how they utilize UPC++ to optimize communication performance.

S. Zhang, R. Sadre, B. A. Legg, H. Pyles, T. Perciano, E. W. Bethel, D. Baker, O. Rübel, J. J. D. Yoreo, "Rotational dynamics and transition mechanisms of surface-adsorbed proteins", Proceedings of the National Academy of Sciences, April 11, 2022, 119:e202024211, doi: 10.1073/pnas.2020242119

Hamish A. Carr, Gunther H. Weber, Christopher M. Sewell, Oliver R\ ubel, Patricia Fasel, James P. Ahrens, "Scalable Contour Tree Computation by Data Parallel Peak Pruning", Transactions on Visualization and Computer Graphics, 2021, 27:2437--2454, doi: 10.1109/TVCG.2019.2948616

Hamish Carr, Oliver Rübel, Gunther H. Weber, James Ahrens, "Optimization and Augmentation for Data Parallel Contour Trees", IEEE Transactions on Visualization and Computer Graphics, 2021, doi: 10.1109/TVCG.2021.3064385

S. S. Sawant, F. Leonard, Z. Yao, A. Nonaka, "ELEQTRONeX: A GPU-Accelerated Exascale Framework for Non-Equilibrium Quantum Transport in Nanomaterials", Submitted for Publication, July 19, 2024, doi: https://doi.org/10.48550/arXiv.2407.14633

S. S. Sawant, Z. Yao, R. Jambunathan, A. Nonaka, "Characterization of Transmission Lines in Microelectronic Circuits Using the ARTEMIS Solver", IEEE Journal on Multiscale and Multiphysics Computational Techniques, December 12, 2022, 8:31-39,

Hang Liu, Anna Scaglione, Sean Peisert,, "Privacy Leakage in Graph Signal to Graph Matching Problems", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, April 11, 2025, doi: 10.1109/ICASSP48485.2024.10447364

Hang Liu, Anna Scaglione, Sean Peisert, "Graph-Signal-to-Graph Matching for Network De-anonymization Attacks", IEEE Transactions on Information Forensics and Security, October 18, 2024, 19:10043-1005, doi: 10.1109/TIFS.2024.3483669

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, "Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets", Cybersecurity, September 25, 2024, doi: 10.1109/ICASSP48485.2024.10447364

Tong Wu, Anna Scaglione, Adrian Petru Surani, Daniel Arnold, Sean Peisert, "Network-Constrained Reinforcement Learning for Optimal EV Charging Control", Proceedings of the IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), October 2023,

Robert Currie, Sean Peisert, Anna Scaglione, Aram Shumavon, Nikhil Ravi, "Data Privacy for the Grid: Toward a Data Privacy Standard for Inverter-Based and Distributed Energy Resources", IEEE Power & Energy Magazine, October 1, 2023,

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, "Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets", Proceedings of the 2023 IEEE International Conference on Smart Applications, Communications and Networking (SmartNets), Istanbul, Turkey, July 25, 2023,

Raksha Ramakrishna, Anna Scaglione, Tong Wu, Nikhil Ravi, Sean Peisert, "Differential Privacy for Class-based Data: A Practical Gaussian Mechanism", June 23, 2023, doi: 10.1109/TIFS.2023.3289128

Nikhil Ravi, Anna Scaglione, Julieta Giraldez, Parth Pradhan, Chuck Moran, Sean Peisert, "Solar Photovoltaic Systems Metadata Inference and Differentially Private Publication", arXiv preprint arXiv:2304.03749, April 7, 2023, doi: 10.48550/arXiv.2304.03749

Daniel Arnold, Sy-Toan Ngo, Ciaran Roberts, Yize Chen, Anna Scaglione, Sean Peisert, "Adam-based Augmented Random Search for Control Policies for Distributed Energy Resource Cyber Attack Mitigation", Proceedings of the 2022 American Control Conference (ACC), June 2022,

Sachin Kadam, Anna Scaglione, Nikhil Ravi, Sean Peisert, Brent Lunghino, Aram Shumavon, Optimum Noise Mechanism for Differentially Private Queries in Discrete Finite Sets, arXiv preprint arXiv:2111.11661, November 23, 2021,

Nikhil Ravi, Anna Scaglione, Sachin Kadam, Reinhard Gentz, Sean Peisert, Brent Lunghino, Emmanuel Levijarvi, Aram Shumavon, Differentially Private K-means Clustering Applied to Meter Data Analysis and Synthesis, arXiv preprint arXiv:2112.03801, November 23, 2021,

Nikhil Ravi, Anna Scaglione, Sean Peisert, Colored Noise Mechanism for Differentially Private Clustering, arXiv preprint arXiv:2111.07850, November 15, 2021,

Ciaran Roberts, Sy-Toan Ngo, Alexandre Milesi, Anna Scaglione, Sean Peisert, Daniel Arnold, "Deep Reinforcement Learning for Mitigating Cyber-Physical DER Voltage Unbalance Attacks”", Proceedings of the 2021 American Control Conference (ACC), May 2021, doi: 10.23919/ACC50511.2021.9482815

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, Ariful Azad, "Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale", 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, doi: 10.1109/IPDPS49936.2021.00018

Nazanin Jafari, Oguz Selvitopi, Cevdet Aykanat, "Fast shared-memory streaming multilevel graph partitioning", Journal of Parallel and Distributed Computing, January 2021, 147:140-151, doi: https://doi.org/10.1016/j.jpdc.2020.09.004

O Selvitopi, B Brock, I Nisa, A Tripathy, K Yelick, A Buluç, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication", Proceedings of the International Conference on Supercomputing, January 2021, 431--442, doi: 10.1145/3447818.3461472

Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu, QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space, Bulletin of the American Physical Society, March 2021,

We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.

Jean Sexton, Zarija Lukic, Ann Almgren, Chris Daley, Brian Friesen, Andrew Myers, and Weiqun Zhang, "Nyx: A Massively Parallel AMR Code for Computational Cosmology", The Journal Of Open Source Software, July 10, 2021,

James F. O'Neill, Tamsin L. Edwards, Daniel F. Martin, Courtney Shafer, Stephen L. Cornford, Hélène L. Seroussi, Sophie Nowicki, Mira Adhikari, Lauren J. Gregoire, "ISMIP6-based Antarctic projections to 2100: simulations with the BISICLES ice sheet model", The Cryosphere, February 4, 2025, 19:541-563, doi: 10.5194/tc-19-541-2025

Courtney Shafer, Daniel F Martin and Esmond G Ng, "Comparing the Shallow-Shelf and L1L2 Approximations using BISICLES in the Context of MISMIP+ with Buttressing Effects", AGU Fall Meeting, December 13, 2021,

Tamsin L. Edwards, Sophie Nowicki, Ben Marzeion, Regine Hock, Heiko Goelzer, Hélène Seroussi, Nicolas C. Jourdain, Donald A. Slater, Fiona E. Turner, Christopher J. Smith, Christine M. McKenna, Erika Simon, Ayako Abe-Ouchi, Jonathan M. Gregory, Eric Larour, William H. Lipscomb, Antony J. Payne, Andrew Shepherd, Cécile Agosta, Patrick Alexander, Torsten Albrecht, Brian Anderson, Xylar Asay-Davis, Andy Aschwanden, Alice Barthel, Andrew Bliss, Reinhard Calov, Christopher Chambers, Nicolas Champollion, Youngmin Choi, Richard Cullather, Joshua Cuzzone, Christophe Dumas, Denis Felikson, Xavier Fettweis, Koji Fujita, Benjamin K. Galton-Fenzi, Rupert Gladstone, Nicholas R. Golledge, Ralf Greve, Tore Hattermann, Matthew J. Hoffman, Angelika Humbert, Matthias Huss, Philippe Huybrechts, Walter Immerzeel, Thomas Kleiner, Philip Kraaijenbrink, Sébastien Le clec’h, Victoria Lee, Gunter R. Leguy, Christopher M. Little, Daniel P. Lowry, Jan-Hendrik Malles, Daniel F. Martin, Fabien Maussion, Mathieu Morlighem, James F. O’Neill, Isabel Nias, Frank Pattyn, Tyler Pelle, Stephen F. Price, Aurélien Quiquet, Valentina Radić, Ronja Reese, David R. Rounce, Martin Rückamp, Akiko Sakai, Courtney Shafer, Nicole-Jeanne Schlegel, Sarah Shannon, Robin S. Smith, Fiammetta Straneo, Sainan Sun, Lev Tarasov, Luke D. Trusel, Jonas Van Breedam, Roderik van de Wal, Michiel van den Broeke, Ricarda Winkelmann, Harry Zekollari, Chen Zhao, Tong Zhang, Thomas Zwinger, "Projected land ice contributions to twenty-first-century sea level rise", Nature, May 5, 2021, 593:74-82, doi: 10.1038/s41586-021-03302-y

Download File: Edwards-et-al-2021-Nature-preprint.pdf (pdf: 40 MB)

Jie Li, George Michelogiannakis, Samuel Maloney, Brandon Cook, Estela Suarez, John Shalf, "Job Scheduling in High Performance Computing Systems with Disaggregated Memory Resources", IEEE International Conference on Cluster Computing (CLUSTER), September 2024, doi: 10.1109/CLUSTER59578.2024.00033

Jie Li, George Michelogiannakis, Brandon Cook, John Shalf, Yong Chen, "Scheduling and Allocation of Disaggregated Memory Resources in HPC Systems", IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, May 2024,

George Michelogiannakis, John Shalf, Chiplets for HPC, OCP Summit, February 6, 2024,

Download File: georgem_hpc.pptx.pdf (pdf: 6.3 MB)

George Michelogiannakis, Yehia Arafa, Brandon Cook, Liang Yuan Dai, Abdel-Hameed Hameed Badawy, Madeleine Glick, Yuyang Wang, Keren Bergman, John shalf, "Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics", IEEE International Conference on Cluster Computing (CLUSTER), November 2023,

Zhenguo Wu, Liang Yuan Dai, Asher Novick, Madeleine Glick, Ziyi Zhu, Sébastien Rumley, George Michelogiannakis, John Shalf, Keren Bergman, "Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications", IEEE Journal of Lightwave Technology, May 2023,

John Shalf, George Michelogiannakis, Heterogeneous Integration for HPC, OCP global summit, October 19, 2022,

Download File: chiplets_2022.pdf (pdf: 1.2 MB)

George Michelogiannakis, Madeleine Glick, John Shalf, Keren Bergman, Photonics as a Means to Implement Intra-rack Resource Disaggregation, SPIE photonics west, March 2022,

George Michelogiannakis, Madeleine Glick, John Shalf, Keren Bergman, "Photonics as a means to implement intra-rack resource disaggregation", Proceedings Volume 12027, Metro and Data Center Optical Networks and Short-Reach Links V, March 2022, doi: https://doi.org/10.1117/12.2607317

George Michelogiannakis, Benjamin Klenk, Brandon Cook, Min Yee Teh, Madeleine Glick, Larry Dennison, Keren Bergman, John Shalf, "A Case For Intra-Rack Resource Disaggregation in HPC", ACM Transactions on Architecture and Code Optimization, February 2022,

Md Abdul M Faysal, Shaikh Arifuzzaman, Cy Chan, Maximilian Bremer, Doru Popovici, John Shalf, "HyPC-Map: A Hybrid Parallel Community Detection Algorithm Using Information-Theoretic Approach", HPEC, September 20, 2021,

Georgios Tzimpragos, Jennifer Volk, Dilip Vasudevan, Nestan Tsiskaridze, George Michelogiannakis, Advait Madhavan, John Shalf, Timothy Sherwood, "Temporal Computing With Superconductors", IEEE MIcro, March 2021, 41:71-79, doi: 10.1109/MM.2021.3066377

George Michelogiannakis, Min Yeh Teh, Madeleine Glick, John Shalf, Keren Bergman, Maximizing The Impact of Emerging Photonic Switches At The System Level, SPIE photonics west, March 2021,

Download File: photonics-west-2021.pdf (pdf: 770 KB)

George Michelogiannakis, Min Yeh Teh, Madeleine Glick, John Shalf, Keren Bergman, "Maximizing the impact of emerging photonic switches at the system level", SPIE 11692, Optical Interconnects XXI, 116920Z, March 2021,

Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, X. S. Li, "GPTune: multitask learning for autotuning exascale applications", PPoPP, February 17, 2021, doi: 10.1145/3437801.3441621

Akel Hashim, Ravi Naik, Alexis Morvan, Jean-Loup Ville, Brad Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin O Brien, Ian Hincks, Joel Wallman, Joseph V Emerson, David Ivan Santiago, Irfan Siddiqi, Scalable Quantum Computing on a Noisy Superconducting Quantum Processor via Randomized Compiling, Bulletin of the American Physical Society, 2021,

Coherent errors in quantum hardware severely limit the performance of quantum algorithms in an unpredictable manner, and mitigating their impact is necessary for realizing reliable, large-scale quantum computations. Randomized compiling achieves this goal by converting coherent errors into stochastic noise, dramatically reducing unpredictable errors in quantum algorithms and enabling accurate predictions of aggregate performance via cycle benchmarking estimates. In this work, we demonstrate significant performance gains under randomized compiling for both the four-qubit quantum Fourier transform algorithm and for random circuits of variable depth on a superconducting quantum processor. We also validate solution accuracy using experimentally-measured error rates. Our results demonstrate that randomized compiling can be utilized to maximally-leverage and predict the capabilities of modern-day noisy quantum processors, paving the way forward for scalable quantum computing.

E. Sohn, C. Kim, A. Sim, D. K. Sung, Y. Son, J. Park, S. Kim, "Toward Performance Prediction in Large-Scale Systems through Temporal System and Application Log Analysis", 39th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2025), 2025,

S. Kim, S. Kim, C. Kim, A. Sim, K. Wu, H. Tang, "SWIFTN: Accelerating Quantum Circuit Simulation Through Tensor Optimization", 25th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2025), 2025,

D. Sung, S. Kim, S. Lee, H. Tang, A. Sim, K. Wu, S. Byna, Y. Son, "Regen: An Object Layout Regenerator on Large-Scale Production HPC Systems", Future Generation Computer Systems, 2025, 107830, doi: 10.1016/j.future.2025.107830

J. Kim, A. Sim, K. Wu, J. Kim, "Improving Slow Transfer Predictions: Generative Methods Compared", IEEE International Conference on Computing, Networking and Communications (ICNC 2025), 2025,

B. Fan, A. Sim, K. Wu, J. Kim, "Conditional Recurrent Neural Networks for Enhancing Throughput Prediction and Slow File Transfers Detection in Large Science Workflows", 22nd IEEE Consumer Communications & Networking Conference (CCNC 2025), 2025,

B. Dong, A. Nayak, K. Wu, V. Tribaldos, J. Ajo-Franklin, Q. Zhang, S. Byna, F. Guo, P. Dobson, A. Sim, "TensorSearch: Parallel Similarity Search on Tensors", IEEE International Conference on Big Data (BigData), 2024,

Download File: TensorSearch-final-version-paper.pdf (pdf: 6.2 MB)

E. Wang, A. Sim, K. Wu, "Comparing Cache Utilization Trends for Regional Scientific Caches with Transfer Learning Models", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), ACM Student Research Competition (SRC), 2024,

M. Sudarshan, A. Sim, K. Wu, "Predicting Dataset Popularity for Improved Distributed Content Caching in High Energy Physics", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), ACM Student Research Competition (SRC), 2024,

V. Lakshminarayana, C. Oguchi, A. Sim, K. Wu, D. Ghosal, "A Study of a Deterministic Networking Framework for Latency Critical Large Scientific Data Transfers", 11th Annual International Workshop on Innovating the Network for Data-Intensive Science (INDIS 2024), 2024,

M. Schreyer, T. Sattarov, A. Sim, K. Wu, "Imb-FinDiff: Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular Data", 5th ACM International Conference on AI in Finance (ICAIF'24), 2024, doi: 10.1145/3677052.3698659

A. Sim, E. Wang, R. Monga, J. Balcas, K. Wu, C. Guok, I. Monga, D. Davila, F. Wurthwein, H. Newman, Comparing Cache Utilization Trends for Regional Data Caches, 27th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2024), 2024,

J. Aldrich, A. Sim, K. Wu, S. Yoo, H. Ito, V. Garonne, E. Lancon, "Exploring Data Caching Policy with Data Access Patterns from dCache Logs", 27th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2024), 2024,

D.K. Sung, Y. Son, A. Sim, K. Wu, S. Byna, H. Tang, H. Eom, C. Kim, S. Kim, "A2FL: Autonomous and Adaptive File Layout in HPC through Real-time Access Pattern Analysis", 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS2024), 2024,

L. Zhou, Q. Lin, K. Chowdhury, S. Masood, A. Eichenberger, H. Min, A. Sim, J. Wang, Y. Wang, K. Wu, B. Yuan, J. Zou, "Serving Deep Learning Model in Relational Databases", 27th International Conference on Extending Database Technology (EDBT2024), 2024,

R. Frehner, K. Wu, A. Sim, J. Kim, K. Stockinger, "Detecting Anomalies in Time Series Using Kernel Density Approaches", IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3371891

C. M. Oguchi, D. Ghosal, A. Sim, K. Wu, "Counterfactual Analysis: A Case Study on Impact of External Events on Building Energy Consumption", International Workshop on Big Data Analytics for Sustainability (BDA4S), 2023,

A, Sharma, X. Li, H. Guan, G. Sun, L. Zhang, L. Wang, K. Wu, L. Cao, E. Zhu, A. Sim, T. Wu, J. Zou, "Automatic Data Transformation Using Large Language Model – An Experimental Study on Building Energy Data", IEEE International Conference on Big Data (BigData), 2023,

C. Sim, K. Wu, A. Sim, I. Monga, C. Guok, D. Hazen, F. Würthwein, D. Davila, H. Newman, J. Balcas, "Predicting Resource Utilization Trends with Southern California Petabyte Scale Cache", 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP2023), 2023, doi: 10.1051/epjconf/202429501044

J. Bellavita, C. Sim, K. Wu, A. Sim, S. Yoo, H. Ito, V. Garonne, E. Lancon, "Understanding Data Access Patterns for dCache System", 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP2023), 2023, doi: 10.1051/epjconf/202429501053

A. Sim, E. Kissel, D. Hazen, C. Guok, "Experiences in deploying in-network data caches", 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP2023), 2023, doi: 10.1051/epjconf/202429507018

J. W. Chung, A. Sim, B. Quiter, Y. Wu, W. Zhao, K. Wu, "Preparing Spectral Data for Machine Learning: A Study of Geological Classification from Aerial Surveys", Machine Learning and the Physical Sciences Workshop (ML4PS), 2023,

R. Monga, A. Sim (advisor), K. Wu (advisor), "Comparative Study of the Cache Utilization Trends for Regional Scientific Data Caches", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’23), ACM Student Research Competition (SRC), First place winner, 2023,

H-C. Yang, L. Jin, A. Lazar, A. Todd-Blick, A. Sim, K. Wu, Q. Chen, C. A. Spurlock, "Gender Gaps in Mode Usage, Vehicle Ownership, and Spatial Mobility When Entering Parenthood: A Life Course Perspective", Systems, 2023, 11(6):314, doi: 10.3390/systems11060314

R. Shao, A. Sim, K. Wu, J. Kim, "Leveraging History to Predict Abnormal Transfers in Distributed Workflows", Sensors, 2023, 23(12):5485, doi: 10.3390/s23125485

Z. Deng, A. Sim, K. Wu, C. Guok, I. Monga, F. Andrijauskas, F. Wuerthwein, D. Weitzel, "Analyzing Transatlantic Network Traffic Patterns with Scientific Data Caches", 6th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2023), 2023, doi: 10.1145/3589012.3594897

C. Guok, E. Kissel, A. Sim, ESnet's In-Network Caching Pilot, The Network Conference 2023 (TNC'23), 2023,

E. Kissel, A. Sim, C. Guok, Experiences in deploying in-network data caches, 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2023), 2023,

J. Bellavita, C. Sim, K. Wu, A. Sim, S. Yoo, H. Ito, V. Garonne, E. Lancon, Understanding Data Access Patterns for dCache System, 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2023), 2023,

C. Sim, K. Wu, A. Sim, I. Monga, C. Guok, F. Wurthwein, D. Davila, H. Newman, J. Balcas, Predicting Resource Usage Trends with Southern California Petabyte Scale Cache, 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2023), 2023,

S. Kim, A. Sim, K. Wu, S. Byna, Y. Son, H. Eom, "Design and Implementation of I/O Performance Prediction Scheme on HPC Systems through Large-scale Log Analysis", Journal of Big Data, 2023, 10(65), doi: 10.1186/s40537-023-00741-4

C. Sim, K. Wu, A. Sim, I. Monga, C. Guok, F. Wurthwein, D. Davila, H. Newman, J. Balcas, "Effectiveness and predictability of in-network storage cache for Scientific Workflows", International Conference on Computing, Networking and Communication (ICNC 2023), 2023, doi: 10.1109/ICNC57223.2023.10074058

J. Wang, K. Wu, A. Sim, S. Hwangbo, "Locating Partial Discharges in Power Transformers with Convolutional Iterative Filtering", Sensors, 2023, 23, doi: 10.3390/s23041789

H-C. Yang, L. Jin, A. Lazar, A. Todd-Blick, A. Sim, K. Wu, Q. Chen, C. A. Spurlock, Gender Gaps in Mode Usage, Vehicle Ownership, and Spatial Mobility When Entering Parenthood: A Life Course Perspective, Transportation Research Board 102nd Annual Meeting,, 2023,

J. Bang, A. Sim, G. Lockwood, H. Eom, H. Sung, "Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage Systems", IEEE Access, 2023, doi: 10.1109/ACCESS.2022.3233829

Julian Bellavita, Alex Sim (advisor), John Wu (advisor), "Predicting Scientific Dataset Popularity Using dCache Logs", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’22), ACM Student Research Competition (SRC), Second place winner, 2022,

Poster (PDF)

The dCache installation is a storage management system that acts as a disk cache for high-energy physics (HEP) data. Storagespace on dCache is limited relative to persistent storage devices, therefore, a heuristic is needed to determine what data should be kept in the cache. A good cache policy would keep frequently accessed data in the cache, but this requires knowledge of future dataset popularity. We present methods for forecasting the number of times a dataset stored on dCache will be accessed in the future. We present a deep neural network that can predict future dataset accesses accurately, reporting a final normalized loss of 4.6e-8. We present a set of algorithms that can forecast future dataset accesses given an access sequence. Included are two novel algorithms, Backup Predictor and Last N Successors, that outperform other file prediction algorithms. Findings suggest that it is possible to anticipate dataset popularity in advance.

C. Sim, C. Guok (advisor), A. Sim (advisor), K. Wu (advisor), "Data Throughput Performance Trends of Regional Scientific Data Cache", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’22), ACM Student Research Competition (SRC), 2022,

J. Wang, K. Wu, A. Sim, S. Hwangbo, "Feature Engineering and Classification Models for Partial Discharge in Power Transformers", arXiv, 2022, doi: 10.48550/arXiv.2210.12216

L. Jin, A. Lazar, C. Brown, V. Garikapati, B. Sun, S. Ravulaparthy, Q. Chen, A. Sim, K. Wu, T. Wenzel, T. Ho, C. A. Spurlock, "What Makes You Hold onto That Old Car? Joint Insights from Machine Learning and Multinomial Logit on Vehicle-level Transaction Decisions", Frontiers in Future Transportation, Connected Mobility and Automation, 2022, 3:894654, doi: 10.3389/ffutr.2022.894654

Sunggon Kim, Alex Sim, Kesheng Wu, Suren Byna, Yongseok Son, "Design and implementation of dynamic I/O control scheme for large scale distributed file systems", Cluster Computing, 2022, 25(6):1--16, doi: 10.1007/s10586-022-03640-0

Download File: wu2022.bib (bib: 22 KB)

R. Han, A. Sim, K. Wu, I. Monga, C. Guok, F. Würthwein, D. Davila, J. Balcas, H. Newman, "Access Trends of In-network Cache for Scientific Data", 5th ACM International Workshop on System and Network Telemetry and Analysis (SNTA), in conjunction with The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2022, doi: 10.1145/3526064.3534110

J. Bellavita, A. Sim, K. Wu, I. Monga, C. Guok, F. Würthwein, D. Davila, "Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches", 5th ACM International Workshop on System and Network Telemetry and Analysis (SNTA) 2022, in conjunction with The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2022, doi: 10.1145/3526064.3534111

R. Shao, J. Kim A. Sim, K. Wu, "Predicting Slow Connections in Scientific Computing", 5th ACM International Workshop on System and Network Telemetry and Analysis (SNTA) 2022, in conjunction with The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2022, doi: 10.1145/3526064.3534112

J. Kim, M. Cafaro, J. Chou, A. Sim, "SNTA’22: The 5th Workshop on Systems and Network Telemetry and Analytics", In the proceedings of The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC'22), 2022, doi: 10.1145/3502181.3535108

D. Bard, C. Snavely, L. Gerhardt, J. Lee, B. Totzke, K. Antypas, W. Arndt, J. Blaschke, S. Byna, R. Cheema, S. Cholia, M. Day, B. Enders, A. Gaur, A. Greiner, T. Groves, M. Kiran, Q. Koziol, T. Lehman, K. Rowland, C. Samuel, A. Selvarajan, A. Sim, D. Skinner, L. Stephey, R. Thomas, G. Torok, "LBNL Superfacility Project Report", Lawrence Berkeley National Laboratory, 2022, doi: 10.48550/arXiv.2206.11992

Yujing Ma, Florin Rusu, Kesheng Wu, Alexander Sim, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Pages: 1088--1097 2022, doi: 10.1109/IPDPSW55747.2022.00177

Download File: wu2022.bib (bib: 22 KB)

J. Kim, M. Jin, Y. Homma, A. Sim, W. Kroeger, K. Wu, "Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow", arXiv, 2022, doi: 10.48550/arXiv.2205.09703

K. Wang, S. Lee, J. Balewski, A. Sim, P. Nugent, A. Agrawal, A. Choudhary, K. Wu, W-K. Liao, "Using Multi-resolution Data to Accelerate Neural Network Training in Scientific Applications", 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2022), 2022, doi: 10.1109/CCGrid54584.2022.00050

B. Weinger, J. Kim, A. Sim, M. Nakashima, N. Moustafa, K. Wu, "Enhancing IoT Anomaly Detection Performance for Federated Learning", Digital Communications and Networks, Special Issue on Edge Computation and Intelligence, 2022, doi: 10.1016/j.dcan.2022.02.007

A. Sim, E. Kissel, C. Guok, "Deploying in-network caches in support of distributed scientific data sharing", arXiv whitepaper, 2022, doi: /10.48550/arXiv.2203.06843

John Wu, Ben Brown, Paolo Calafiura, Quincey Koziol, Dongeun Lee, Alex Sim, Devesh Tiwari, Support for In-Flight Data Analyses in Scientific Workflows, DOE ASCR Workshop on the Management and Storage of Scientific Data, 2022, doi: 10.2172/1843500

John Wu, Bin Dong, Alex Sim, Automating Data Management Through Unified Runtime Systems, DOE ASCR Workshop on the Management and Storage of Scientific Data, 2022, doi: 10.2172/1843500

A. Pereira, A. Sim, K. Wu, S. Yoo, H. Ito, "Data access pattern analysis for dCache storage system", International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2022), 2022,

Ling Jin, Alina Lazar, Caitlin Brown, Bingrong Sun, Venu Garikapati, Srinath Ravulaparthy, Qianmiao Chen, Alexander Sim, Kesheng Wu, Tin Ho, Thomas Wenzel, C. Anna Spurlock, What Makes You Hold on to That Old Car? Joint Insights from Machine Learning and Multinomial Logit on Vehicle-level Transaction Decisions, Transportation Research Board 101st Annual Meeting, 2022,

Download File: wu2022.bib (bib: 22 KB)

J. Bang, C. Kim, K. Wu, A. Sim, S. Byna, H. Sung, H. Eom, "An In-Depth I/O Pattern Analysis in HPC Systems", IEEE International Conference on High Performance Computing, Data & Analytics (HiPC2021), 2021, doi: 10.1109/HiPC53243.2021.00056

S. Lee, Q. Kang, K. Wang, J. Balewski, A. Sim, A. Agrawal, A. Choudhary, P. Nugent, K. Wu, W-K. Liao, "Asynchronous I/O Strategy for Large-Scale Deep Learning Applications", IEEE International Conference on High Performance Computing, Data & Analytics (HiPC2021), 2021, doi: 10.1109/HiPC53243.2021.00046

A. Lazar, L. Jin, C. Brown, C. A. Spurlock, A. Sim, K. Wu, "Performance of the Gold Standard and Machine Learning in Predicting Vehicle Transactions", the 3rd International Workshop on Big Data Tools, Methods, and Use Cases for Innovative Scientific Discovery (BTSD 2021), 2021, doi: 10.1109/BigData52589.2021.9671286

J. Cheung, A. Sim, J. Kim, K. Wu, "Performance Prediction of Large Data Transfers", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), ACM Student Research Competition (SRC), 2021,

A. Syal, A. Lazar, J. Kim, A. Sim, K. Wu, "Network traffic performance analysis from passive measurements using gradient boosting machine learning", International Journal of Big Data Intelligence, 2021, 8:13-30, doi: 10.1504/IJBDI.2021.118741

Y. Ma, F. Rusu, K. Wu, A. Sim, Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers, arXiv preprint arXiv:2110.07029, 2021,

E. Copps, A. Sim (Advisor), K. Wu (Advisor), "Analyzing scientific data sharing patterns with in-network data caching", ACM Richard Tapia Celebration of Diversity in Computing (TAPIA 2021), ACM Student Research Competition (SRC), 2021,

M. Nakashima, A. Sim, Y. Kim, J. Kim, J. Kim, "Automated Feature Selection for Anomaly Detection in Network Traffic Data", ACM Transactions on Management Information Systems (TMIS), 2021, 12:1-28, doi: 10.1145/3446636

A. Lazar, A. Sim, K. Wu, "GPU-based Classification for Wireless Intrusion Detection", 4th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2021), 2021, doi: 10.1145/3452411.3464445

Y. Wang, K. Wu, A. Sim, S. Yoo, S. Misawa, "Access Patterns of Disk Cache for Large Scientific Archive", 4th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2021), 2021, doi: 10.1145/3452411.3464444

E. Copps, H. Zhang, A. Sim, K. Wu, I. Monga, C. Guok, F. Würthwein, D. Davila, E. Fajardo, "Analyzing scientific data sharing patterns with in-network data caching", 4th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2021), 2021, doi: 10.1145/3452411.3464441

Y. Ma, F. Ruso, A. Sim, K. Wu, "Adaptive Stochastic Gradient Descent for Deep Learning on Heterogeneous CPU+GPU Architectures", Heterogeneity in Computing Workshop (HCW 2021), in conjunction with the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2021, doi: 10.1109/IPDPSW52791.2021.00012

J. Kim, A. Sim, J. Kim, K, Wu, J. Hahm, Improving Botnet Detection with Recurrent Neural Network and Transfer Learning, arXiv preprint arXiv:2104.12602, 2021,

Akel Hashim, Ravi K. Naik, Alexis Morvan, Jean-Loup Ville, Bradley Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin P. O Brien, Ian Hincks, Joel J. Wallman, Joseph Emerson, Irfan Siddiqi, "Randomized Compiling for Scalable Quantum Computing on a Noisy Superconducting Quantum Processor", Physical Review X, 2021, 11:041039, doi: 10.1103/PhysRevX.11.041039

Akel Hashim, Ravi Naik, Alexis Morvan, Jean-Loup Ville, Brad Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin O Brien, Ian Hincks, Joel Wallman, Joseph V Emerson, David Ivan Santiago, Irfan Siddiqi, Scalable Quantum Computing on a Noisy Superconducting Quantum Processor via Randomized Compiling, Bulletin of the American Physical Society, 2021,

Coherent errors in quantum hardware severely limit the performance of quantum algorithms in an unpredictable manner, and mitigating their impact is necessary for realizing reliable, large-scale quantum computations. Randomized compiling achieves this goal by converting coherent errors into stochastic noise, dramatically reducing unpredictable errors in quantum algorithms and enabling accurate predictions of aggregate performance via cycle benchmarking estimates. In this work, we demonstrate significant performance gains under randomized compiling for both the four-qubit quantum Fourier transform algorithm and for random circuits of variable depth on a superconducting quantum processor. We also validate solution accuracy using experimentally-measured error rates. Our results demonstrate that randomized compiling can be utilized to maximally-leverage and predict the capabilities of modern-day noisy quantum processors, paving the way forward for scalable quantum computing.

I. Srivastava, A. J. Nonaka, W. Zhang, A. L. Garcia, and J. B. Bell, "Molecular Fluctuations Inhibit Intermittency in Compressible Turbulence", Submitted for publication, January 11, 2025, doi: https://doi.org/10.48550/arXiv.2501.06396

M. Polimeno, C. Kim, F. Blanchette, I. Srivastava, A. Garcia, A. Nonaka, and J. Bell, "Thermodynamic consistency and fluctuations in mesoscopic stochastic simulations of reactive gas mixture", December 9, 2024, doi: https://doi.org/10.48550/arXiv.2412.07048

A. L. Garcia, J. B. Bell, A. Nonaka, I. Srivastava, D. Ladiges, C. Kim, "An Introduction to Computational Fluctuating Hydrodynamics", June 18, 2024, doi: https://doi.org/10.48550/arXiv.2406.12157

J. G. Wang, D. R. Ladiges, I. Srivastava, S. P. Carney, A. J. Nonaka, A. L. Garcia, J. B. Bell, "Steric effects in induced-charge electro-osmosis for strong electric fields", Physical Review Fluids, August 29, 2023, 8:083702,

I. Srivastava, D. R. Ladiges, A. Nonaka, A. L. Garcia, J. B. Bell, "Staggered Scheme for the Compressible Fluctuating Hydrodynamics of Multispecies Fluid Mixtures", Physical Review E, January 24, 2023, 107:015305, doi: 10.1103/PhysRevE.107.015305

D. R. Ladiges, J. G. Wang, I. Srivastava, S. P. Carney, A. Nonaka, A. L. Garcia, A. Donev, J. B. Bell, "Modeling Electrokinetic Flows with the Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm", Physical Review E, November 19, 2022, 106:035104, doi: 10.1103/PhysRevE.106.035104

J. M. Monti, J. T. Clemmer, I. Srivastava, L. E. Silbert, G. S. Grest, J. B. Lechman, "Large-Scale Frictionless Jamming with Power-Law Particle Size Distributions", Physical Review E, September 2, 2022, 106:034901, doi: 10.1103/PhysRevE.106.034901

A. P. Santos, I. Srivastava, L. E. Silbert, J. B. Lechman, G. S. Grest, "Fluctuations and power-law scaling of dry, frictionless granular rheology near the hard-particle limit", Physical Review Fluids, August 19, 2022, 7:084303, doi: 10.1103/PhysRevFluids.7.084303

W. D. Fullmer, R. Porcu, J. Musser, A. S. Almgren, I. Srivastava, "The Divergence of Nearby Trajectories in Soft-Sphere DEM", Particuology, April 1, 2022, 63:1 - 8, doi: 10.1016/j.partic.2021.06.008

J. T. Clemmer, I. Srivastava, G. S. Grest, J. B. Lechman, "Shear is Not Always Simple: Rate-Dependent Effects of Loading Geometry on Granular Rheology", Physical Review Letters, December 22, 2021, 127:268003, doi: 10.1103/PhysRevLett.127.268003

I. Srivastava, L. E. Silbert, J. B. Lechman, G. S. Grest, "Flow and Arrest in Stressed Granular Materials", Soft Matter, December 17, 2021, doi: 10.1039/D1SM01344K

I. Srivastava, S. A. Roberts, J. T. Clemmer, L. E. Silbert, J. B. Lechman, G. S. Grest, "Jamming of Bidisperse Frictional Spheres", Physical Review Research, August 13, 2021, 3:L032042, doi: 10.1103/PhysRevResearch.3.L032042

Suben Kumar Saha, Houjun Tang, Wei Zhang, Suren Byna, "Distributed Metadata Querying on HPC Systems", Under Review, July 10, 2025,

S. Kim, S. Kim, C. Kim, A. Sim, K. Wu, H. Tang, "SWIFTN: Accelerating Quantum Circuit Simulation Through Tensor Optimization", 25th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2025), 2025,

D. Sung, S. Kim, S. Lee, H. Tang, A. Sim, K. Wu, S. Byna, Y. Son, "Regen: An Object Layout Regenerator on Large-Scale Production HPC Systems", Future Generation Computer Systems, 2025, 107830, doi: 10.1016/j.future.2025.107830

Rajeev Jain, Houjun Tang, Akash Dhruv, Suren Byna, "Enabling Data Reduction for Flash-X Simulations", 10th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD), 2024,

M Scot Breitenfeld, Houjun Tang, Huihuo Zheng, Jordan Henderson, Suren Byna, "HDF5 in the Exascale Era: Delivering Efficient and Scalable Parallel I/O for Exascale Applications", The International Journal of High Performance Computing Applications, October 16, 2024, doi: 10.1177/10943420241288244

David McCallen, Arben Pitarka, Houjun Tang, Ramesh Pankajakshan, Anders Petersson, Mamun Miah, "Transformational Regional-Scale Earthquake Simulations with the DOE EarthQuake SIMulation Exascale Framework", Scientific Impact of the Exascale Computing Project (ECP), August 1, 2024, doi: 10.1109/MCSE.2024.3397768

D.K. Sung, Y. Son, A. Sim, K. Wu, S. Byna, H. Tang, H. Eom, C. Kim, S. Kim, "A2FL: Autonomous and Adaptive File Layout in HPC through Real-time Access Pattern Analysis", 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS2024), 2024,

Wei Zhang, Houjun Tang, Suren Byna, "IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management", The 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing. Philadelphia, 2024 (CCGrid 2024), Philadelphia, PA, USA, IEEE, May 9, 2024, doi: 10.1109/CCGrid59990.2024.00072

Download File: 956600a598.pdf (pdf: 782 KB)

D McCallen, A Pitarka, H Tang, R Pankajakshan, NA Petersson, M Miah, "Transformational Regional-Scale Earthquake Simulations with the DOE EarthQuake SIMulation (EQSIM) Exascale Framework", Computing in Science & Engineering, May 8, 2024, doi: 10.1109/MCSE.2024.3397768

David McCallen, Arben Pitarka, Houjun Tang, Ramesh Pankajakshan, N Anders Petersson, Mamun Miah, Junfei Huang, "Regional-scale fault-to-structure earthquake simulations with the EQSIM framework: Workflow maturation and computational performance on GPU-accelerated exascale platforms", Earthquake Spectra, May 3, 2024, 40(3):1615-1652, doi: 10.1177/87552930241246235

R. Han, M, Zheng, S. Byna, H. Tang, B. Dong, D. Dai, Y. Chen, D. Kim, J. Hassoun, D. Thorsley, M. Wolf, "PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems", IEEE Transactions on Parallel and Distributed Systems, March 14, 2024,

Jean Luca Bez, Houjun Tang, Scot Breitenfeld, Huihuo Zheng, Wei-Keng Liao, Kaiyuan Hou, Zanhua Huang, Suren Byna, "h5bench: Exploring HDF5 Access Patterns Performance in Pre-Exascale Platforms", Concurrency and Computation: Practice and Experience (CCPE), January 31, 2024,

Daoce Wang, Jesus Pulido, Pascal Grosset, Jiannan Tian, Sian Jin, Houjun Tang, Jean Sexton, Sheng Di, Kai Zhao, Bo Fang, Zarija Lukić, Franck Cappello, James Ahrens, Dingwen Tao, "AMRIC: A novel in situ lossy compression framework for efficient I/O in adaptive mesh refinement applications", SC23: International Conference for High Performance Computing, Networking, Storage and Analysis, November 12, 2023, doi: 10.1145/3581784.3613212

John Ravi, Suren Byna, Quincey Koziol, Houjun Tang, Michela Becchi, "Evaluating Asynchronous Parallel I/O on HPC Systems", 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 15, 2023, doi: 10.1109/IPDPS54959.2023.00030

Md Kamal Hossain Chowdhury, Houjun Tang, Jean Luca Bez, Purushotham V. Bangalore, Suren Byna, "Efficient Asynchronous I/O with Request Merging", 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), St. Petersburg, FL, USA, IEEE, 2023, 628-636, doi: 10.1109/IPDPSW59300.2023.00107

Rajeev Jain, Houjun Tang, Akash Dhruv, J Austin Harris, Suren Byna, "Accelerating flash-x simulations with asynchronous I/O", https://ieeexplore.ieee.org/abstract/document/10026923/, November 13, 2022, doi: 10.1109/PDSW56643.2022.00008

Sian Jin, Dingwen Tao, Houjun Tang, Sheng Di, Suren Byna, Zarija Lukic, Franck Cappello, "Accelerating parallel write via deeply integrating predictive lossy compression with HDF5", SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, November 13, 2022, doi: 10.1109/SC41404.2022.00066

Runzhou Han, Suren Byna, Houjun Tang, Bin Dong, and Mai Zheng,, "PROV-IO: An I/O-Centric Provenance Framework for Scientific Data on HPC Systems", HPDC 2022, June 23, 2022,

Xiaoxia Zhang, Degang Chen, Hong Yu, Guoyin Wang, Houjun Tang, Kesheng Wu, "Improving nonnegative matrix factorization with advanced graph regularization", Information Sciences, June 1, 2022, 597:125-143, doi: 10.1016/j.ins.2022.03.008

Huihuo Zheng, Venkatram Vishwanath, Quincey Koziol, Houjun Tang, John Ravi, John Mainzer, Suren Byna, "HDF5 Cache VOL: Efficient and scalable parallel I/O through caching data on node-local storage", 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 16, 2022, doi: 10.1109/CCGrid54584.2022.00015

Houjun Tang, Quincey Koziol, John Ravi, and Suren Byna,, "Transparent Asynchronous Parallel I/O using Background Threads", IEEE Transactions on Parallel and Distributed Systems, April 4, 2022, 33, doi: 10.1109/TPDS.2021.3090322

Cong Xu, Suparna Bhattacharya, Martin Foltin, Suren Byna, and Paolo Faraboschi, "Data-Aware Storage Tiering for Deep Learning", 6th International Parallel Data Systems Workshop (PDSW) 2021, held in conjunction with SC21, November 21, 2021,

Houjun Tang, Bing Xie, Suren Byna, Phillip Carns, Quincey Koziol, Sudarsun Kannan, Jay Lofstead, and Sarp Oral,, "SCTuner: An Auto-tuner Addressing Dynamic I/O Needs on Supercomputer I/O Sub-systems", 6th International Parallel Data Systems Workshop (PDSW) 2021, held in conjunction with SC21, November 21, 2021,

Suren Byna, Houjun Tang, and Quincey Koziol,, Automatic and Transparent Scientific Data Management with Object Abstractions, PASC 2021, in a Minisymposium on "Data Movement Orchestration on HPC Systems", July 31, 2021,

Bing Xie, Houjun Tang, Suren Byna, Jesse Hanley, Quincey Koziol, Tonglin Li, Sarp Oral,, "Battle of the Defaults: Extracting Performance Characteristics of HDF5 under Production Load", CCGrid 2021, May 31, 2021,

David McCallen, Houjun Tang, Suiwen Wu, Eric Eckert, Junfei Huang, N Anders Petersson, "Coupling of regional geophysics and local soil-structure models in the EQSIM fault-to-structure earthquake simulation framework", The International Journal of High Performance Computing Applications, May 25, 2021, doi: 10.1177/10943420211019118

David McCallen, Anders Petersson, Arthur Rodgers, Arben Pitarka, Mamun Miah, Floriana Petrone, Bjorn Sjogreen, Norman Abrahamson, Houjun Tang, "EQSIM—A multidisciplinary framework for fault-to-structure earthquake simulations on exascale computers part I: Computational models and workflow", Earthquake Spectra, May 1, 2021, 37:707-735, doi: 10.1177/8755293020970982

Jean Luca Bez, Houjun Tang, Bing Xie, David Williams-Young, Rob Latham, Rob Ross, Sarp Oral, Suren Byna, "I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis", 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW), January 1, 2021, 15-22, doi: 10.1109/PDSW54622.2021.00008

Tonglin Li, Suren Byna, Quincey Koziol, Houjun Tang, Jean Luca Bez, Qiao Kang, "h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns", Cray User Group (CUG) 2021, January 1, 2021,

Y. Tang, R. Chen, M. Lou, J. Fan, C. Yu, A. Nonaka, Z. Yao., W. Gao, "Optical Neural Engine for Solving Scientific Partial Differential Equations", Submitted for publication, September 27, 2024, doi: https://doi.org/10.48550/arXiv.2409.06234

Will Thacher, Hans Johansen, Daniel Martin, "A high order cut-cell method for solving the shallow-shelf equations", Journal of Computational Science, August 1, 2024, 80, doi: 10.1016/j.jocs.2024.102319

Will Thacher and Hans Johansen and Daniel Martin, "A high order Cartesian grid, finite volume method for elliptic interface problems", Journal of Computational Physics, October 15, 2023, 491, doi: 10.1016/j.jcp.2023.112351

David Trebotich, Randolph R Settgast, Terry Ligocki, William Tobin, Gregory H Miller, Sergi Molins, Carl I Steefel, "A multiphysics coupling framework for exascale simulation of subsurface fracture evolution", Frontiers in High Performance Computing, June 30, 2024, 2, doi: 10.3389/fhpcp.2024.1416727

Download File: FrontiersHPC2024.pdf (pdf: 1.4 MB)

Sergi Molins, David Trebotich, Carl I. Steefel, "Approaches for the simulation of coupled processes in evolving fractured porous media enabled by exascale computing", Computing in Science & Engineering, May 23, 2024, doi: 10.1109/MCSE.2024.3403983

Download File: CiSE2024.pdf (pdf: 6.6 MB)

David Trebotich, "Exascale CFD in Heterogeneous Systems", Journal of Fluids Engineering, February 9, 2024, 146(4):041104, doi: 10.1115/1.4064534

Download File: FE-23-1357_AuthorProof.pdf (pdf: 1.5 MB)

Tim Kneafsey, David Trebotich, Terry Ligocki, "Direct Numerical Simulation of Flow Through Nanoscale Shale Pores in a Mesoscale Sample", Album of Porous Media, edited by E.F. Médici, A.D. Otero, (Springer Cham: April 14, 2023) Pages: 87 doi: https://doi.org/10.1007/978-3-031-23800-0_69

Sergi Molins, David Trebotich, "Pore-Scale Controls on Calcite Dissolution using Direct Numerical Simulations", Album of Porous Media, edited by E.F. Médici, A.D. Otero, (Springer Cham: April 14, 2023) Pages: 135 doi: https://doi.org/10.1007/978-3-031-23800-0_112

David Trebotich, Terry Ligocki, "High Resolution Simulation of Fluid Flow in Press Felts Used in Paper Manufacturing", Album of Porous Media, edited by E.F. Médici, A.D. Otero, (Springer Cham: April 14, 2023) Pages: 132 doi: https://doi.org/10.1007/978-3-031-23800-0_109

T. Groves, N. Ravichandrasekaran, B. Cook, N. Keen, D. Trebotich, N. Wright, B. Alverson, D. Roweth, K. Underwood, "Not All Applications Have Boring Communication Patterns: Profiling Message Matching with BMM", Concurrency and Computation: Practice and Experience, April 26, 2021, doi: 0.1002/cpe.6380

Roel Van Beeumen, Lana Periša, Daniel Kressner, Chao Yang, "Solving a class of infinite‐dimensional tensor eigenvalue problems by translational invariant tensor ring approximations", Numerical Linear Algebra with Applications, July 24, 2024, 31:22573, doi: 10.1002/nla.2573

Jan Balewski, Mercy G Amankwah, Roel Van Beeumen, E Wes Bethel, Talita Perciano, Daan Camps, "Quantum-parallel vectorized data encodings and computations on trapped-ion and transmon QPUs", Journal, February 10, 2024, 14, doi: https://doi.org/10.1038/s41598-024-53720-x

Daan Camps, Lin Lin, Roel Van Beeumen, Chao Yang, "Explicit quantum circuits for block encodings of certain sparse matrices", SIAM Journal on Matrix Analysis and Applications, January 1, 2024, 45:801-827, doi: 10.1137/22M1484298

E Wes Bethel, Mercy G Amankwah, Jan Balewski, Roel Van Beeumen, Daan Camps, Daniel Huang, Talita Perciano, "Quantum computing and visualization: A disruptive technological change ahead", Journal, November 6, 2023, 43, doi: https://doi.org/10.1109/MCG.2023.3316932

M. G. Amankwah, D. Camps, E. W. Bethel, R. Van Beeumen, T. Perciano, "Quantum pixel representations and compression for N-dimensional images", Nature Scientific Reports, May 11, 2022, 12:7712, doi: 10.1038/s41598-022-11024-y

Roel Van Beeumen, Khaled Z. Ibrahim, Gregory D. Kahanamoku-Meyer, Norman Y. Yao, Chao Yang, "Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization", The International Journal of High Performance Computing Applications, March 19, 2022, 36:307-319, doi: 10.1177/10943420211060365

R. Van Beeumen, L. Perisa, D. Kressner, C. Yang, "A Flexible Power Method for Solving Infinite Dimensional Tensor Eigenvalue Problems", January 30, 2021,

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Maximilian Bremer, Nirmalendu Patra, Tan Nguyen, Dilip Vasudevan, Cy Chan, "Benefits of Optimistic Parallel Discrete Event Simulation for Network-on-Chip Simulation", 2023 IEEE/ACM 27th International Symposium on Distributed Simulation and Real Time Applications (DS-RT), Singapore, October 2, 2023, doi: 10.1109/DS-RT58998.2023.00013

Ran Cheng, Christoph Kirst, Dilip Vasudevan, "Superconducting-Oscillatory Neural Network With Pixel Error Detection for Image Recognition", IEEE Transaction on Applied Superconductivity, August 2023, 33:1-7,

Dilip Vasudevan, George Michelogiannakis, "Efficient Temporal Arithmetic Logic Design for Superconducting RSFQ Logic", IEEE Transactions on Applied Superconductivity, March 2023,

George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021,

Georgios Tzimpragos, Jennifer Volk, Dilip Vasudevan, Nestan Tsiskaridze, George Michelogiannakis, Advait Madhavan, John Shalf, Timothy Sherwood, "Temporal Computing With Superconductors", IEEE MIcro, March 2021, 41:71-79, doi: 10.1109/MM.2021.3066377

Serges Love Teutu Talla, Isabelle Kemajou-Brown, Cy Chan, Bin Wang, "A Binary Multi-Subsystems Transportation Networks Estimation using Mobiliti Data", 2021 American Control Conference (ACC), May 25, 2021,

Hengjie Wang, Robert Planas, Aparna Chandramowlishwaran, Ramin Bostanabad, "Mosaic flows: A transferable deep learning framework for solving PDEs on unseen domains", Computer Methods in Applied Mechanics and Engineering, 2022, 389:114424,

D. R. Ladiges, J. G. Wang, I. Srivastava, S. P. Carney, A. Nonaka, A. L. Garcia, A. Donev, J. B. Bell, "Modeling Electrokinetic Flows with the Discrete Ion Stochastic Continuum Overdamped Solvent Algorithm", Physical Review E, November 19, 2022, 106:035104, doi: 10.1103/PhysRevE.106.035104

J. Galen Wang, Roseanna N. Zia, "Vitrification is a spontaneous non-equilibrium transition driven by osmotic pressure", Journal of Physics: Condensed Matter, April 23, 2021, doi: 10.1088/1361-648x/abeec0

Nathan A. Kimbrel, Allison E. Ashley-Koch, Xue J. Qin, Jennifer H. Lindquist, Melanie E. Garrett, Michelle F. Dennis, Lauren P. Hair, Jennifer E. Huffman, Daniel A. Jacobson, Ravi K. Madduri, Jodie A. Trafton, Hilary Coon, Anna R. Docherty, Niamh Mullins, Douglas M. Ruderfer, Philip D. Harvey, Benjamin H. McMahon, David W. Oslin, Jean C. Beckham, Elizabeth R. Hauser, Michael A. Hauser, Million Veteran Program Suicide Exemplar Workgroup, International Suicide Genetics Consortium, Veterans Affairs Mid-Atlantic Mental Illness Research Education and Clinical Center Workgroup, Veterans Affairs Million Veteran Program, "Identification of Novel, Replicable Genetic Risk Loci for Suicidal Thoughts and Behaviors Among US Military Veterans", JAMA Psychiatry, February 1, 2023, 80:100-191, doi: 10.1001/jamapsychiatry.2022.3896

Xiange Wang, Rafael Zamora-Resendiz, Courtney D. Shelley, Carrie Manore, Xinlian Liu, David W. Oslin, Benjamin McMahon, Jean C. Beckham, Nathan A. Kimbrel, Silvia Crivelli, "An examination of the association between altitude and suicide deaths, suicide attempts, and suicidal ideation among veterans at both the patient and geospatial level", Journal of Psychiatric Research, July 11, 2022,

Destinee Morrow, Rafael Zamora-Resendiz, Jean C Beckham, Nathan A Kimbrel, David W Oslin, Suzanne Tamang, Million Veteran Program Suicide Exemplar Workgroup, Silvia Crivelli, "A case for developing domain-specific vocabularies for extracting suicide factors from healthcare notes", Journal of Psychiatric Research, July 1, 2022, 151:328-338, doi: 10.1016/j.jpsychires.2022.04.009

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Johnny Corbino, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'23)", Poster at Exascale Computing Project (ECP) Annual Meeting 2023, January 2023,

The Pagoda project is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. The first component is GASNet-EX, a portable, high-performance, global-address-space communication library. The second component is UPC++, a C++ template library. Together, these libraries enable agile, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

GASNet-EX is a portable, high-performance communications middleware library which leverages hardware support to implement Remote Memory Access (RMA) and Active Message communication primitives. GASNet-EX supports a broad ecosystem of alternative HPC programming models, including UPC++, Legion, Chapel and multiple implementations of UPC and Fortran Coarrays. GASNet-EX is implemented directly over the native APIs for networks of interest in HPC. The tight semantic match of GASNet-EX APIs to the client requirements and hardware capabilities often yields better performance than competing libraries.

UPC++ provides high-level productivity abstractions appropriate for Partitioned Global Address Space (PGAS) programming such as: remote memory access (RMA), remote procedure call (RPC), support for accelerators (e.g. GPUs), and mechanisms for aggressive asynchrony to hide communication costs. UPC++ implements communication using GASNet-EX, delivering high performance and portability from laptops to exascale supercomputers. HPC application software using UPC++ includes: MetaHipMer2 metagenome assembler, SIMCoV viral propagation simulation, NWChemEx TAMM, and graph computation kernels from ExaGraph.

John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'22)", Poster at Exascale Computing Project (ECP) Annual Meeting 2022, May 5, 2022,

We present UPC++ and GASNet-EX, distributed libraries which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Procedure Call (RPC) and for Remote Memory Access (RMA) to host and GPU memories. The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Daniel Waters, Colin A. MacLean, Dan Bonachea, Paul H. Hargrove, "Demonstrating UPC++/Kokkos Interoperability in a Heat Conduction Simulation (Extended Abstract)", Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), November 2021, doi: 10.25344/S4630V

We describe the replacement of MPI with UPC++ in an existing Kokkos code that simulates heat conduction within a rectangular 3D object, as well as an analysis of the new code’s performance on CUDA accelerators. The key challenges were packing the halos in Kokkos data structures in a way that allowed for UPC++ remote memory access, and streamlining synchronization costs. Additional UPC++ abstractions used included global pointers, distributed objects, remote procedure calls, and futures. We also make use of the device allocator concept to facilitate data management in memory with unique properties, such as GPUs. Our results demonstrate that despite the algorithm’s good semantic match to message passing abstractions, straightforward modifications to use UPC++ communication deliver vastly improved performance and scalability in the common case. We find the one-sided UPC++ version written in a natural way exhibits good performance, whereas the message-passing version written in a straightforward way exhibits performance anomalies. We argue this represents a productivity benefit for one-sided communication models.

PAW-ATM'21

Paul H. Hargrove, Dan Bonachea, Colin A. MacLean, Daniel Waters, "GASNet-EX Memory Kinds: Support for Device Memory in PGAS Programming Models", The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'21) Research Poster, November 2021, doi: 10.25344/S4P306

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work includes two major components: UPC++ (a C++ template library) and GASNet-EX (a portable, high-performance communication library). This poster describes recent advances in GASNet-EX to efficiently implement Remote Memory Access (RMA) operations to and from memory on accelerator devices such as GPUs. Performance is illustrated via benchmark results from UPC++ and the Legion programming system, both using GASNet-EX as their communications library.

John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

Paul H. Hargrove, Dan Bonachea, Max Grossman, Amir Kamil, Colin A. MacLean, Daniel Waters, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'21)", Poster at Exascale Computing Project (ECP) Annual Meeting 2021, April 2021,

We present UPC++ and GASNet-EX, which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Memory Access (RMA) and Remote Procedure Call (RPC). The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients. GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems

Jordan A. Welsman, Gunther H. Weber, Oluwamayowa O. Amusat, Anna Giannakou, Lavanya Ramakrishnan, "Enhancing Electron Microscopy Image Classification Using Data Augmentation", SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, November 17, 2024, 64-71, doi: 10.1109/SCW63240.2024.00016

E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, Dave Pugmire, Silvio Rizzi, Thompson, Will Usher, Gunther H. Weber, Brad Whitlock, Wolf, Kesheng Wu, "Proximity Portability and In Transit, M-to-N Data Partitioning and Movement in SENSEI", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_20

E. Wes Bethel, Burlen Loring, Utkarsh Ayatchit, David Camp, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, Kesheng Wu, "The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_13

Sugeerth Murugesan, Mariam Kiran, Bernd Hamann, Gunther H. Weber, "Netostat: Analyzing Dynamic Flow Patterns in High-Speed Networks", Cluster Computing, 2022, doi: 10.1007/s10586-022-03543-0

Jan-Tobias Sohns, Gunther H. Weber, Christoph Garth, "Distributed Task-Parallel Topology-Controlled Volume Rendering", Topological Methods in Data Analysis and Visualization VI: Theory, Algorithms, and Applications, (Springer International Publishing: 2021) Pages: 55-69 doi: 10.1007/978-3-030-83500-2_4

Hamish A. Carr, Gunther H. Weber, Christopher M. Sewell, Oliver R\ ubel, Patricia Fasel, James P. Ahrens, "Scalable Contour Tree Computation by Data Parallel Peak Pruning", Transactions on Visualization and Computer Graphics, 2021, 27:2437--2454, doi: 10.1109/TVCG.2019.2948616

Hamish Carr, Oliver Rübel, Gunther H. Weber, James Ahrens, "Optimization and Augmentation for Data Parallel Contour Trees", IEEE Transactions on Visualization and Computer Graphics, 2021, doi: 10.1109/TVCG.2021.3064385

Robbie Sadre, Colin Ophus, Anstasiia Butko, Gunther H Weber, "Deep Learning Segmentation of Complex Features in Atomic-Resolution Phase Contrast Transmission Electron Microscopy Images", Microscopy and Microanalysis, 2021, doi: 10.1017/S1431927621000167

Raghu Bollapragada, Stefan M. Wild, "Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization", Mathematical Programming Computation, 2023, 15:327--364, doi: 10.1007/s12532-023-00233-9

Tyler H. Chang, Stefan M. Wild, ParMOO: A Python library for parallel multiobjective simulation optimization, Journal of Open Source Software, Pages: 4468 2023, doi: 10.21105/joss.04468

V. Cirigliano, Z. Davoudi, J. Engel, R. J. Furnstahl, G. Hagen, U. Heinz, H. Hergert, M. Horoi, C. W. Johnson, A. Lovato, E. Mereghetti, W. Nazarewicz, A. Nicholson, T. Papenbrock, S. Pastore, M. Plumlee, D. R. Phillips, P. E. Shanahan, S. R. Stroberg, F. Viens, A. Walker-Loud, K. A. Wendt, S. M. Wild, "Towards Precise and Accurate Calculations of Neutrinoless Double-Beta Decay", Journal of Physics G: Nuclear and Particle Physics, 2022, 49:120502, doi: 10.1088/1361-6471/aca03e

Ozge Surer, Filomena M. Nunes, Matthew Plumlee, Stefan M. Wild, "Uncertainty Quantification in Breakup Reactions", Physical Review C, 2022, 106:024607, doi: 10.1103/PhysRevC.106.024607

V. Cirigliano, Z. Davoudi, J. Engel, R. J. Furnstahl, G. Hagen, U. Heinz, H. Hergert, M. Horoi, C. W. Johnson, A. Lovato, E. Mereghetti, W. Nazarewicz, A. Nicholson, T. Papenbrock, S. Pastore, M. Plumlee, D. R. Phillips, P. E. Shanahan, S. R. Stroberg, F. Viens, A. Walker-Loud, K. A. Wendt, S. M. Wild, "Towards Precise and Accurate Calculations of Neutrinoless Double-Beta Decay: Project Scoping Workshop Report", 2022, doi: 10.48550/ARXIV.2207.01085

Aleksandra Ciprijanovic, Diana Kafkes, Gregory Snyder, F. Javier Sanchez, Gabriel Nathan Perdue, Kevin Pedro, Brian Nord, Sandeep Madireddy, Stefan M. Wild, "DeepAdversaries: Examining the Robustness of Deep Learning Models for Galaxy Morphology Classification", Machine Learning: Science and Technology, 2022, 3:035007, doi: 10.1088/2632-2153/ac7f1a

Stephen Hudson, Jeffrey Larson, John-Luke Navarro, Stefan M. Wild, "libEnsemble: A Library to Coordinate the Concurrent Evaluation of Dynamic Ensembles of Calculations", IEEE Transactions on Parallel and Distributed Systems, 2022, 33:977--988, doi: 10.1109/TPDS.2021.3082815

Jed Brown, Yunhui He, Scott MacLachlan, Matt Menickelly, Stefan M. Wild, "Tuning Multigrid Methods with Robust Optimization and Local Fourier Analysis", SIAM Journal on Scientific Computing, 2021, A109--A138, doi: 10.1137/19m1308669

H. Klion, R. Jambunathan, M. E. Rowan, E. Yang, D. Willcox, J.-L. Vay, R. Lehe, A. Myers, A. Huebl, W. Zhang, "Particle-in-Cell Simulations of Relativistic Magnetic Reconnection with Advanced Maxwell Solver Algorithms", The Astrophysical Journal, July 13, 2023, 952,

D. Fan, D. E. Willcox, C. DeGrendele, M. Zingale, and A. Nonaka, "Neural Networks for Nuclear Reactions in MAESTROeX", he Astrophysical Journal, November 29, 2022, 940,

Nan Ding, Oscar Antepara, Zhengji Zhao, Brian Austin, Leonid Oliker, Nicholas J. Wright, Samuel Williams, "Maximizing Power-Constrained Supercomputing Throughput", ISC'25, June 11, 2025,

Download File: ISC25_GPU_Power_Cap.pdf (pdf: 5.2 MB)

Zhe Bai, Xishuo Wei, William Tang, Leonid Oliker, Zhihong Lin, Samuel Williams, "Transfer Learning Nonlinear Plasma Dynamic Transitions in Low Dimensional Embeddings via Deep Neural Networks", Machine Learning: Science and Technology, April 8, 2025, doi: 10.1088/2632-2153/adca83

Mustafa Mutiur Rahman, Zhe Bai, Jacob Robert King, Carl R. Sovinec, Xishuo Wei, Samuel Williams, Yang Liu, "Sparsified time-dependent Fourier neural operators for fusion simulations", Phys. Plasmas, December 4, 2024, 31:12, doi: 10.1063/5.0232503

Xuan Jiang, Raja Sengupta, James Demmel, Samuel Williams, "Large scale multi-GPU based parallel traffic simulation for accelerated traffic assignment and propagation", Transportation Research Part C: Emerging Technologies, December 2024, 169:104873, doi: 10.1016/j.trc.2024.104873

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Sterling Smith, Zichuan Anthony Xing, Torrin Bechtel, Severin Denk, Earl DeShazer, Orso Meneghini, Tom Neiser, Laurie Stephey, Oscar Antepara, Christopher Mitchell Clark, Eli Dart, Pengfei Ding, Sean Flanagan, Raffi Nazikian, David Schissel, Christine Simpson, Nicholas Tyler, Thomas D. Uram, Samuel Williams, "Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D National Fusion Facility Using Leadership Class Computing Resources", Extreme-Scale Experiment-in-the-Loop Computing (XLOOP), November 2024,

Oscar Antepara, Samuel Williams, Max Carlson, Jerry Watkins, "Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers", Performance, Portability & Productivity in HPC (P3HPC), November 2024,

Download File: P3HPC24_IceSheet_final-v2.pdf (pdf: 1.4 MB)

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Download File: P3HPC24_bricks_mg_final.pdf (pdf: 358 KB)

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_DCGM_final.pdf (pdf: 319 KB)

Shashank Subramanian, Ermal Rrapaj, Peter Harrington, Smeet Chheda, Steven Farrell, Brian Austin, Samuel Williams, Nicholas Wright, Wahid Bhimji, "Comprehensive Performance Modeling and System Design Insights for Foundation Models", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_ModelingTransformerTraining_final.pdf (pdf: 736 KB)

Mahesh Lakshminarasimhan, Oscar Antepara, Tuowen Zhao, Benjamin Sepanski, Protonu Basu, Hans Johansen, Mary Hall, Samuel Williams, "Bricks: A high-performance portability layer for computations on block-structured grids", The International Journal of High Performance Computing Applications (IJHPCA), August 19, 2024, doi: 10.1177/10943420241268288

Mahesh Lakshminarasimhan, Mary Hall, Samuel Williams, Oscar Antepara, "BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs", Proceedings of the 53rd International Conference on Parallel Processing (ICPP), August 12, 2024,

Download File: ICPP24_BrickDL_final-v2.pdf (pdf: 1.7 MB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, Evaluating the Performance of One-sided Communication on CPUs and GPUs, 2023 International Workshop on Performance, Portability & Productivity in HPC, November 13, 2023,

Download File: ws_p3hpc112.pdf (pdf: 4.7 MB)

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

Oscar Antepara, Samuel Williams, Scott Kruger, Torrin Bechtel, Joseph McClenaghan, Lang Lao, "Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code", Workshop on Accelerator Programming and Directives (WACCPD), November 2023,

Download File: WACCPD23_EFIT_final.pdf (pdf: 697 KB)

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Download File: P3HPC23_bricks_final-v4.pdf (pdf: 684 KB)

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, "Evaluating the Performance of One-sided Communication on CPUs and GPUs", 2023 International Workshop on Performance, Portability & Productivity in HPC, November 12, 2023,

Download File: OneSided_MPI_P3HPC_.pdf (pdf: 2.5 MB)

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, February 8, 2023,

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", PMBS, November 2022,

Download File: PMBS22_GPU_final.pdf (pdf: 719 KB)

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,

Download File: MCHPC22_final.pdf (pdf: 401 KB)

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, May 2022,

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Marco Siracusa, Emanuele Del Sozzo, Marco Rabozzi, Lorenzo Di Tucci, Samuel Williams, Donatella Sciuto, Marco Domenico Santambrogio, "A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model", Transactions on Computers (TC), September 2021, doi: 10.1109/TC.2021.3111761

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Nan Ding, Muaaz Awan, Samuel Williams, "Instruction Roofline: An insightful visual performance model for GPUs", CCPE, August 4, 2021, doi: 10.1002/cpe.6591

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

Charlene Yang, Yunsong Wang, Thorsten Kurth, Steven Farrell, Samuel Williams, "Hierarchical Roofline Performance Analysis for Deep Learning Applications", Intelligent Computing, LNNS, July 15, 2021, doi: 10.1007/978-3-030-80126-7

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671

Samuel Williams, Roofline Analysis on NVIDIA GPUs, ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-2-NVIDIA.pdf (pdf: 14 MB)

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-1-intro.pdf (pdf: 22 MB)

Tuowen Zhao, Mary Hall, Hans Johansen, Samuel Williams, "Improving Communication by Optimizing On-Node Data Movement with Data Layout", PPoPP, February 2021,

Download File: PPoPP-Bricks-MPI-final.pdf (pdf: 864 KB)

Karol Kowalski, Raymond Bair, Nicholas P. Bauman, Jeffery S. Boschen, Eric J. Bylaska, Jeff Daily, Wibe A. de Jong, Thom Dunning, Niranjan Govind, Robert J. Harrison, Murat Keceli, Kristopher Keipert, Sriram Krishnamoorthy, Suraj Kumar, Erdal Mutlu, Bruce Palmer, Ajay Panyala, Bo Peng, Ryan M. Richard, T. P. Straatsma, Peter Sushko, Edward F. Valeev, Marat Valiev, Hubertus J. J. van Dam, Jonathan M. Waldrop, David B. Williams-Young, Chao Yang, Marcin Zalewski, Theresa L. Windus, "From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape", Chemical Reviews, March 31, 2021, doi: 10.1021/acs.chemrev.0c00998

Jean Luca Bez, Houjun Tang, Bing Xie, David Williams-Young, Rob Latham, Rob Ross, Sarp Oral, Suren Byna, "I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis", 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW), January 1, 2021, 15-22, doi: 10.1109/PDSW54622.2021.00008

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_DCGM_final.pdf (pdf: 319 KB)

Shashank Subramanian, Ermal Rrapaj, Peter Harrington, Smeet Chheda, Steven Farrell, Brian Austin, Samuel Williams, Nicholas Wright, Wahid Bhimji, "Comprehensive Performance Modeling and System Design Insights for Foundation Models", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_ModelingTransformerTraining_final.pdf (pdf: 736 KB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

D.K. Sung, Y. Son, A. Sim, K. Wu, S. Byna, H. Tang, H. Eom, C. Kim, S. Kim, "A2FL: Autonomous and Adaptive File Layout in HPC through Real-time Access Pattern Analysis", 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS2024), 2024,

L. Zhou, Q. Lin, K. Chowdhury, S. Masood, A. Eichenberger, H. Min, A. Sim, J. Wang, Y. Wang, K. Wu, B. Yuan, J. Zou, "Serving Deep Learning Model in Relational Databases", 27th International Conference on Extending Database Technology (EDBT2024), 2024,

R. Frehner, K. Wu, A. Sim, J. Kim, K. Stockinger, "Detecting Anomalies in Time Series Using Kernel Density Approaches", IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3371891

C. M. Oguchi, D. Ghosal, A. Sim, K. Wu, "Counterfactual Analysis: A Case Study on Impact of External Events on Building Energy Consumption", International Workshop on Big Data Analytics for Sustainability (BDA4S), 2023,

A, Sharma, X. Li, H. Guan, G. Sun, L. Zhang, L. Wang, K. Wu, L. Cao, E. Zhu, A. Sim, T. Wu, J. Zou, "Automatic Data Transformation Using Large Language Model – An Experimental Study on Building Energy Data", IEEE International Conference on Big Data (BigData), 2023,

C. Sim, K. Wu, A. Sim, I. Monga, C. Guok, D. Hazen, F. Würthwein, D. Davila, H. Newman, J. Balcas, "Predicting Resource Utilization Trends with Southern California Petabyte Scale Cache", 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP2023), 2023, doi: 10.1051/epjconf/202429501044

J. Bellavita, C. Sim, K. Wu, A. Sim, S. Yoo, H. Ito, V. Garonne, E. Lancon, "Understanding Data Access Patterns for dCache System", 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP2023), 2023, doi: 10.1051/epjconf/202429501053

J. W. Chung, A. Sim, B. Quiter, Y. Wu, W. Zhao, K. Wu, "Preparing Spectral Data for Machine Learning: A Study of Geological Classification from Aerial Surveys", Machine Learning and the Physical Sciences Workshop (ML4PS), 2023,

R. Monga, A. Sim (advisor), K. Wu (advisor), "Comparative Study of the Cache Utilization Trends for Regional Scientific Data Caches", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’23), ACM Student Research Competition (SRC), First place winner, 2023,

H-C. Yang, L. Jin, A. Lazar, A. Todd-Blick, A. Sim, K. Wu, Q. Chen, C. A. Spurlock, "Gender Gaps in Mode Usage, Vehicle Ownership, and Spatial Mobility When Entering Parenthood: A Life Course Perspective", Systems, 2023, 11(6):314, doi: 10.3390/systems11060314

R. Shao, A. Sim, K. Wu, J. Kim, "Leveraging History to Predict Abnormal Transfers in Distributed Workflows", Sensors, 2023, 23(12):5485, doi: 10.3390/s23125485

Z. Deng, A. Sim, K. Wu, C. Guok, I. Monga, F. Andrijauskas, F. Wuerthwein, D. Weitzel, "Analyzing Transatlantic Network Traffic Patterns with Scientific Data Caches", 6th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2023), 2023, doi: 10.1145/3589012.3594897

J. Bellavita, C. Sim, K. Wu, A. Sim, S. Yoo, H. Ito, V. Garonne, E. Lancon, Understanding Data Access Patterns for dCache System, 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2023), 2023,

C. Sim, K. Wu, A. Sim, I. Monga, C. Guok, F. Wurthwein, D. Davila, H. Newman, J. Balcas, Predicting Resource Usage Trends with Southern California Petabyte Scale Cache, 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2023), 2023,

S. Kim, A. Sim, K. Wu, S. Byna, Y. Son, H. Eom, "Design and Implementation of I/O Performance Prediction Scheme on HPC Systems through Large-scale Log Analysis", Journal of Big Data, 2023, 10(65), doi: 10.1186/s40537-023-00741-4

C. Sim, K. Wu, A. Sim, I. Monga, C. Guok, F. Wurthwein, D. Davila, H. Newman, J. Balcas, "Effectiveness and predictability of in-network storage cache for Scientific Workflows", International Conference on Computing, Networking and Communication (ICNC 2023), 2023, doi: 10.1109/ICNC57223.2023.10074058

J. Wang, K. Wu, A. Sim, S. Hwangbo, "Locating Partial Discharges in Power Transformers with Convolutional Iterative Filtering", Sensors, 2023, 23, doi: 10.3390/s23041789

H-C. Yang, L. Jin, A. Lazar, A. Todd-Blick, A. Sim, K. Wu, Q. Chen, C. A. Spurlock, Gender Gaps in Mode Usage, Vehicle Ownership, and Spatial Mobility When Entering Parenthood: A Life Course Perspective, Transportation Research Board 102nd Annual Meeting,, 2023,

Julian Bellavita, Alex Sim (advisor), John Wu (advisor), "Predicting Scientific Dataset Popularity Using dCache Logs", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’22), ACM Student Research Competition (SRC), Second place winner, 2022,

Poster (PDF)

The dCache installation is a storage management system that acts as a disk cache for high-energy physics (HEP) data. Storagespace on dCache is limited relative to persistent storage devices, therefore, a heuristic is needed to determine what data should be kept in the cache. A good cache policy would keep frequently accessed data in the cache, but this requires knowledge of future dataset popularity. We present methods for forecasting the number of times a dataset stored on dCache will be accessed in the future. We present a deep neural network that can predict future dataset accesses accurately, reporting a final normalized loss of 4.6e-8. We present a set of algorithms that can forecast future dataset accesses given an access sequence. Included are two novel algorithms, Backup Predictor and Last N Successors, that outperform other file prediction algorithms. Findings suggest that it is possible to anticipate dataset popularity in advance.

C. Sim, C. Guok (advisor), A. Sim (advisor), K. Wu (advisor), "Data Throughput Performance Trends of Regional Scientific Data Cache", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’22), ACM Student Research Competition (SRC), 2022,

J. Wang, K. Wu, A. Sim, S. Hwangbo, "Feature Engineering and Classification Models for Partial Discharge in Power Transformers", arXiv, 2022, doi: 10.48550/arXiv.2210.12216

L. Jin, A. Lazar, C. Brown, V. Garikapati, B. Sun, S. Ravulaparthy, Q. Chen, A. Sim, K. Wu, T. Wenzel, T. Ho, C. A. Spurlock, "What Makes You Hold onto That Old Car? Joint Insights from Machine Learning and Multinomial Logit on Vehicle-level Transaction Decisions", Frontiers in Future Transportation, Connected Mobility and Automation, 2022, 3:894654, doi: 10.3389/ffutr.2022.894654

Sunggon Kim, Alex Sim, Kesheng Wu, Suren Byna, Yongseok Son, "Design and implementation of dynamic I/O control scheme for large scale distributed file systems", Cluster Computing, 2022, 25(6):1--16, doi: 10.1007/s10586-022-03640-0

Download File: wu2022.bib (bib: 22 KB)

R. Han, A. Sim, K. Wu, I. Monga, C. Guok, F. Würthwein, D. Davila, J. Balcas, H. Newman, "Access Trends of In-network Cache for Scientific Data", 5th ACM International Workshop on System and Network Telemetry and Analysis (SNTA), in conjunction with The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2022, doi: 10.1145/3526064.3534110

J. Bellavita, A. Sim, K. Wu, I. Monga, C. Guok, F. Würthwein, D. Davila, "Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches", 5th ACM International Workshop on System and Network Telemetry and Analysis (SNTA) 2022, in conjunction with The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2022, doi: 10.1145/3526064.3534111

R. Shao, J. Kim A. Sim, K. Wu, "Predicting Slow Connections in Scientific Computing", 5th ACM International Workshop on System and Network Telemetry and Analysis (SNTA) 2022, in conjunction with The 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2022, doi: 10.1145/3526064.3534112

Yujing Ma, Florin Rusu, Kesheng Wu, Alexander Sim, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Pages: 1088--1097 2022, doi: 10.1109/IPDPSW55747.2022.00177

Download File: wu2022.bib (bib: 22 KB)

J. Kim, M. Jin, Y. Homma, A. Sim, W. Kroeger, K. Wu, "Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow", arXiv, 2022, doi: 10.48550/arXiv.2205.09703

K. Wang, S. Lee, J. Balewski, A. Sim, P. Nugent, A. Agrawal, A. Choudhary, K. Wu, W-K. Liao, "Using Multi-resolution Data to Accelerate Neural Network Training in Scientific Applications", 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2022), 2022, doi: 10.1109/CCGrid54584.2022.00050

B. Weinger, J. Kim, A. Sim, M. Nakashima, N. Moustafa, K. Wu, "Enhancing IoT Anomaly Detection Performance for Federated Learning", Digital Communications and Networks, Special Issue on Edge Computation and Intelligence, 2022, doi: 10.1016/j.dcan.2022.02.007

Lipeng Wan, Axel Huebl, Junmin Gu, Franz Poeschel, Ana Gainaru, Ruonan Wang, Jieyang Chen, Xin Liang, Dmitry Ganyushin, Todd Munson, Ian Foster, Jean-Luc Vay, Norbert Podhorszki, Kesheng Wu, Scott Klasky, "Improving I/O Performance for Exascale Applications Through Online Data Layout Reorganization", IEEE Transactions on Parallel and Distributed Systems, 2022, 33:878-890, doi: 10.1109/TPDS.2021.3100784

John Wu, Ben Brown, Paolo Calafiura, Quincey Koziol, Dongeun Lee, Alex Sim, Devesh Tiwari, Support for In-Flight Data Analyses in Scientific Workflows, DOE ASCR Workshop on the Management and Storage of Scientific Data, 2022, doi: 10.2172/1843500

John Wu, Bin Dong, Alex Sim, Automating Data Management Through Unified Runtime Systems, DOE ASCR Workshop on the Management and Storage of Scientific Data, 2022, doi: 10.2172/1843500

A. Pereira, A. Sim, K. Wu, S. Yoo, H. Ito, "Data access pattern analysis for dCache storage system", International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2022), 2022,

Ling Jin, Alina Lazar, Caitlin Brown, Bingrong Sun, Venu Garikapati, Srinath Ravulaparthy, Qianmiao Chen, Alexander Sim, Kesheng Wu, Tin Ho, Thomas Wenzel, C. Anna Spurlock, What Makes You Hold on to That Old Car? Joint Insights from Machine Learning and Multinomial Logit on Vehicle-level Transaction Decisions, Transportation Research Board 101st Annual Meeting, 2022,

Download File: wu2022.bib (bib: 22 KB)

E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, Dave Pugmire, Silvio Rizzi, Thompson, Will Usher, Gunther H. Weber, Brad Whitlock, Wolf, Kesheng Wu, "Proximity Portability and In Transit, M-to-N Data Partitioning and Movement in SENSEI", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_20

E. Wes Bethel, Burlen Loring, Utkarsh Ayatchit, David Camp, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, Kesheng Wu, "The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_13

J. Bang, C. Kim, K. Wu, A. Sim, S. Byna, H. Sung, H. Eom, "An In-Depth I/O Pattern Analysis in HPC Systems", IEEE International Conference on High Performance Computing, Data & Analytics (HiPC2021), 2021, doi: 10.1109/HiPC53243.2021.00056

S. Lee, Q. Kang, K. Wang, J. Balewski, A. Sim, A. Agrawal, A. Choudhary, P. Nugent, K. Wu, W-K. Liao, "Asynchronous I/O Strategy for Large-Scale Deep Learning Applications", IEEE International Conference on High Performance Computing, Data & Analytics (HiPC2021), 2021, doi: 10.1109/HiPC53243.2021.00046

A. Lazar, L. Jin, C. Brown, C. A. Spurlock, A. Sim, K. Wu, "Performance of the Gold Standard and Machine Learning in Predicting Vehicle Transactions", the 3rd International Workshop on Big Data Tools, Methods, and Use Cases for Innovative Scientific Discovery (BTSD 2021), 2021, doi: 10.1109/BigData52589.2021.9671286

J. Cheung, A. Sim, J. Kim, K. Wu, "Performance Prediction of Large Data Transfers", ACM/IEEE The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), ACM Student Research Competition (SRC), 2021,

B Mohammed, M Kiran; N Krishnaswamy; Keshang, Wu, "Predicting WAN Traffic Volumes using Fourier and Multivariate SARIMA Approach", International Journal of Big Data Intelligence, November 3, 2021, doi: 10.1504/IJBDI.2021.118742

A. Syal, A. Lazar, J. Kim, A. Sim, K. Wu, "Network traffic performance analysis from passive measurements using gradient boosting machine learning", International Journal of Big Data Intelligence, 2021, 8:13-30, doi: 10.1504/IJBDI.2021.118741

Y. Ma, F. Rusu, K. Wu, A. Sim, Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers, arXiv preprint arXiv:2110.07029, 2021,

E. Copps, A. Sim (Advisor), K. Wu (Advisor), "Analyzing scientific data sharing patterns with in-network data caching", ACM Richard Tapia Celebration of Diversity in Computing (TAPIA 2021), ACM Student Research Competition (SRC), 2021,

A. Lazar, A. Sim, K. Wu, "GPU-based Classification for Wireless Intrusion Detection", 4th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2021), 2021, doi: 10.1145/3452411.3464445

Y. Wang, K. Wu, A. Sim, S. Yoo, S. Misawa, "Access Patterns of Disk Cache for Large Scientific Archive", 4th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2021), 2021, doi: 10.1145/3452411.3464444

E. Copps, H. Zhang, A. Sim, K. Wu, I. Monga, C. Guok, F. Würthwein, D. Davila, E. Fajardo, "Analyzing scientific data sharing patterns with in-network data caching", 4th ACM International Workshop on System and Network Telemetry and Analysis (SNTA 2021), 2021, doi: 10.1145/3452411.3464441

Y. Ma, F. Ruso, A. Sim, K. Wu, "Adaptive Stochastic Gradient Descent for Deep Learning on Heterogeneous CPU+GPU Architectures", Heterogeneity in Computing Workshop (HCW 2021), in conjunction with the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2021, doi: 10.1109/IPDPSW52791.2021.00012

J. Kim, A. Sim, J. Kim, K, Wu, J. Hahm, Improving Botnet Detection with Recurrent Neural Network and Transfer Learning, arXiv preprint arXiv:2104.12602, 2021,

Donghun Koo, Jaehwan Lee, Jialin Liu, Eun-Kyu Byun, Jae-Hyuck Kwak, Glenn K Lockwood, Soonwook Hwang, Katie Antypas, Kesheng Wu, Hyeonsang Eom, "An empirical study of I/O separation for burst buffers in HPC systems", Journal of Parallel and Distributed Computing, 2021, 148:96-108, doi: 10.1016/j.jpdc.2020.10.007

Dong Min Roh, Dean Lee, Pieter Maris, Esmond Ng, James P Vary, Chao Yang, "Accelerating eigenvalue computation for nuclear structure calculations via perturbative corrections", Journal of Computational Physics, June 15, 2025, 531:113921, doi: 10.1016/j.jcp.2025.113921

Abdullah Alperen, Nan Ding, Khaled Z. Ibrahim, Pieter Maris, Leonid Oliker, Chao Yang, Hasan Metin Aktulga, "Optimizing Nuclear Configuration Interaction Calculations on GPUs: A Comparative Performance Study of Programming Models", https://isc.app.swapcard.com/event/isc-high-performance-2025/planning/UGxhbm5pbmdfMjU4OTMyNg==, June 12, 2025,

Download File: ISC25_MFDn_opt.pdf (pdf: 7.7 MB)

Cian C Reeves, Gaurav Harsha, Avijit Shee, Yuanran Zhu, Thomas Blommel, Chao Yang, K Birgitta Whaley, Dominika Zgid, Vojtěch Vlček, "Performance of wave function and Green's function methods for non-equilibrium many-body dynamics", Physical Review Research, April 1, 2025, 7:023002, doi: 10.1103/PhysRevResearch.7.023002

Pavlo Golub, Chao Yang, Vojtěch Vlček, Libor Veis, "Quantum Chemical Density Matrix Renormalization Group Method Boosted by Machine Learning", Physical Chemistry Letters, March 24, 2025, 16, doi: 10.1021/acs.jpclett.5c00207

Yuanran Zhu, Jia Yin, Cian C Reeves, Chao Yang, Vojtěch Vlček, "Predicting nonequilibrium Green's function dynamics and photoemission spectra via nonlinear integral operator learning", Machine Learning Science and Technology, February 4, 2025, 6:015027, doi: 10.1088/2632-2153/ada99d

Diyi Liu, Weijie Du, Lin Lin, James P. Vary, Chao Yang, "An efficient quantum circuit for block encoding a pairing Hamiltonian", Journal of Computational Science, February 1, 2025, 85:102480, doi: 10.1016/j.jocs.2024.102480

Senwei Liang1, Karol Kowalski, Chao Yang, Nicholas P. Bauman, "Effective many-body interactions in reduced-dimensionality spaces through neural network models", Phys. Rev. Research, December 17, 2024, 6:043287, doi: 10.1103/PhysRevResearch.6.043287

Gunhee Park, Zhen Huang, Yuanran Zhu, Chao Yang, Garnet Kin-Lic Chan, Lin Lin, "Quasi-Lindblad pseudomode theory for open quantum systems", Physical Review B, November 25, 2024, 110:195148, doi: 10.1103/PhysRevB.110.195148

Senwei Liang, Linghua Zhu, Xiaolin Liu, Chao Yang, Xiaosong Li, "Artificial-intelligence-driven shot reduction in quantum measurement", Chemical Physics Review, October 29, 2024, 5:041403, doi: 10.1063/5.0219663

Yang Yu, Alexander F Kemper, Chao Yang, Emanuel Gull, "Denoising of imaginary time response functions with Hankel projections", Physical Review Research, August 23, 2024, 6:L032042, doi: 10.1103/PhysRevResearch.6.L032042

Roel Van Beeumen, Lana Periša, Daniel Kressner, Chao Yang, "Solving a class of infinite‐dimensional tensor eigenvalue problems by translational invariant tensor ring approximations", Numerical Linear Algebra with Applications, July 24, 2024, 31:22573, doi: 10.1002/nla.2573

Ivan Maliyov, Jia Yin, Jia Yao, Chao Yang, Marco Bernardi, "Dynamic mode decomposition of nonequilibrium electron-phonon dynamics: accelerating the first-principles real-time Boltzmann equation", npj Computational Materials, June 10, 2024, 10:123, doi: 10.1038/s41524-024-01308-4

Alexander F Kemper, Chao Yang, Emanuel Gull, "Denoising and extension of response functions in the time domain", Physical Review Letters, April 19, 2024, 132:160403, doi: 10.1103/PhysRevLett.132.160403

Linghua Zhu, Senwei Liang, Chao Yang, Xiaosong Li, "[PDF] from arxiv.org Optimizing shot assignment in variational quantum eigensolver measurement", Journal of Chemical Theory and Computation, March 14, 2024, 20:2390-2403, doi: 10.1021/acs.jctc.3c01113

Hardeep Bassi, Yuanran Zhu, Senwei Liang, Jia Yin, Cian C Reeves, Vojtěch Vlček, Chao Yang, "Learning nonlinear integral operators via recurrent neural networks and its application in solving integro-differential equations", Machine Learning with Applications, March 2, 2024, 15:100524, doi: 10.1016/j.mlwa.2023.100524

Daan Camps, Lin Lin, Roel Van Beeumen, Chao Yang, "Explicit quantum circuits for block encodings of certain sparse matrices", SIAM Journal on Matrix Analysis and Applications, January 1, 2024, 45:801-827, doi: 10.1137/22M1484298

Meiyue Shao, Dossay Oryspayev, Chao Yang, Pieter Maris, Brandon Cook, "Fault-tolerant LOBPCG for nuclear CI calculations", HPCAsia '23: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, December 31, 2023, 88 - 95, doi: 10.1145/3578178.3578240

David B Williams-Young, Stephen H Yuwono, A Eugene DePrince III, Chao Yang, "Approximate exponential integrators for time-dependent equation-of-motion coupled cluster theory", Journal of Chemical Theory and Computation, December 13, 2023, 19:9177-9186, doi: 10.1021/acs.jctc.3c00911

Xingyi Guan, Joseph P Heindel, Taehee Ko, Chao Yang, Teresa Head-Gordon, "Using machine learning to go beyond potential energy surface benchmarking for chemical reactivity", Nature Computational Science, November 16, 2023, 3:965--974, doi: 10.1038/s43588-023-00549-5

Abdullah Alperen, Hasan Metin Aktulga, Pieter Maris, Chao Yang, "Hybrid eigensolvers for nuclear configuration interaction calculations", Journal of Computational Science, November 1, 2023, 292:108888,

Senwei Liang, Aditya N Singh, Yuanran Zhu, David T Limmer, Chao Yang, "Probing reaction channels via reinforcement learning", Machine Learning Science and Technology, October 6, 2023, 4:045003, doi: 10.1088/2632-2153/acfc33

Cian C Reeves, Yuanran Zhu, Chao Yang, Vojtěch Vlček, "Unimportance of memory for the time nonlocal components of the Kadanoff-Baym equations", Physical Review B, September 25, 2023, 108:115152, doi: 10.1103/PhysRevB.108.115152

Taehee Ko, Joseph P Heindel, Xingyi Guan, Teresa Head-Gordon, David B Williams-Young, Chao Yang, "Using diffusion maps to analyze reaction dynamics for a hydrogen combustion benchmark dataset", Journal of Chemical Theory and Computation, August 16, 2023, 19:5872-5885, doi: 10.1021/acs.jctc.3c00426

Leopoldo Mejía, Jia Yin, David R Reichman, Roi Baer, Chao Yang, Eran Rabani, "Stochastic real-time second-order Green’s function theory for neutral excitations in molecules and nanostructures", Journal of Chemical Theory and Computation, August 4, 2023, 19:5563-5571, doi: 10.1021/acs.jctc.3c00296

Jia Yin, Yang-hao Chan, Felipe H da Jornada, Diana Y Qiu, Chao Yang, Steven G Louie, "Analyzing and predicting non-equilibrium many-body dynamics via dynamic mode decomposition", Journal of Computational Physics, March 15, 2023, 477:111909, doi: 10.1016/j.jcp.2023.111909

Cian C Reeves, Jia Yin, Yuanran Zhu, Khaled Z Ibrahim, Chao Yang, Vojtěch Vlček, "Dynamic mode decomposition for extrapolating nonequilibrium Green's-function dynamics", Physical Review B, February 3, 2023, 107:075107, doi: 10.1103/PhysRevB.107.075107

Khaled Z Ibrahim, Chao Yang, Pieter Maris, "Performance portability of sparse block diagonal matrix multiple vector multiplications on gpus", 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), January 30, 2023, doi: 10.1109/P3HPC56579.2022.00011

Li Zhou, Chao Yang, Weiguo Gao, Talita Perciano, Karen M Davies, Nicholas K Sauter, "A machine learning pipeline for membrane segmentation of cryo-electron tomograms", Journal of Computational Science, January 30, 2023, 66:101904, doi: 10.1016/j.jocs.2022.101904 Get rights and content Under a Creative Commons license

Eric Cancès, Antoine Levitt, Yvon Maday, Chao Yang, "Numerical methods for Kohn–Sham models: Discretization, algorithms, and error analysis", Density Functional Theory: Modeling, Mathematical Analysis, Computational Methods, and Applications, edited by Eric Cancès, Gero Friesecke, ( December 19, 2022) Pages: 333-400 doi: 10.1007/978-3-031-22340-2_7

Jia Yin, Yang-hao Chan, Felipe H da Jornada, Diana Y Qiu, Steven G Louie, Chao Yang, "Using dynamic mode decomposition to predict the dynamics of a two-time non-equilibrium Green’s function", Journal of Computational Science, October 20, 2022, 64:101843, doi: 10.1016/j.jocs.2022.101843

Shizhe Jiao, Zhenlin Zhang, Kai Wu, Lingyun Wan, Huanhuan Ma, Jielan Li, Sheng Chen, Xinming Qin, Jie Liu, Zijing Ding, Jinlong Yang, Yingzhou Li, Wei Hu, Lin Lin, Chao Yang, "KSSOLV 2.0: An efficient MATLAB toolbox for solving the Kohn-Sham equations with plane-wave basis set", Computer Physics Communications, October 1, 2022, 279:108424, doi: 10.1016/j.cpc.2022.108424

Alexis W Mills, Joshua J Goings, David Beck, Chao Yang, Xiaosong Li, "Exploring potential energy surfaces using reinforcement machine learning", Journal of Chemical Information and Modeling, June 16, 2022, 62:3169-3179, doi: 10.1021/acs.jcim.2c00373

Roel Van Beeumen, Khaled Z. Ibrahim, Gregory D. Kahanamoku-Meyer, Norman Y. Yao, Chao Yang, "Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization", The International Journal of High Performance Computing Applications, March 19, 2022, 36:307-319, doi: 10.1177/10943420211060365

Li Zhou, Lihao Yan, Mark A. Caprio, Weiguo Gao, Chao Yang, "Solving the k-sparse Eigenvalue Problem with Reinforcement Learning", CSIAM Trans. Appl. Math., November 1, 2021, 2:697-723, doi: 10.4208/csiam-am.2020-0220

J. Goings, H. Hu, C. Yang, X. Li, "Reinforcement Learning Configuration Interaction", Journal of Chemical Theory and Computation, August 23, 2021, 17:5482-5491, doi: 10.1021/acs.jctc.1c00010

Karol Kowalski, Raymond Bair, Nicholas P. Bauman, Jeffery S. Boschen, Eric J. Bylaska, Jeff Daily, Wibe A. de Jong, Thom Dunning, Niranjan Govind, Robert J. Harrison, Murat Keceli, Kristopher Keipert, Sriram Krishnamoorthy, Suraj Kumar, Erdal Mutlu, Bruce Palmer, Ajay Panyala, Bo Peng, Ryan M. Richard, T. P. Straatsma, Peter Sushko, Edward F. Valeev, Marat Valiev, Hubertus J. J. van Dam, Jonathan M. Waldrop, David B. Williams-Young, Chao Yang, Marcin Zalewski, Theresa L. Windus, "From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape", Chemical Reviews, March 31, 2021, doi: 10.1021/acs.chemrev.0c00998

R. Van Beeumen, L. Perisa, D. Kressner, C. Yang, "A Flexible Power Method for Solving Infinite Dimensional Tensor Eigenvalue Problems", January 30, 2021,

Y. Tang, R. Chen, M. Lou, J. Fan, C. Yu, A. Nonaka, Z. Yao., W. Gao, "Optical Neural Engine for Solving Scientific Partial Differential Equations", Submitted for publication, September 27, 2024, doi: https://doi.org/10.48550/arXiv.2409.06234

S. S. Sawant, F. Leonard, Z. Yao, A. Nonaka, "ELEQTRONeX: A GPU-Accelerated Exascale Framework for Non-Equilibrium Quantum Transport in Nanomaterials", Submitted for Publication, July 19, 2024, doi: https://doi.org/10.48550/arXiv.2407.14633

P. Kumar, M. Hoffmann, A. Nonaka, S. Salahuddin, and Z. Yao, "3D ferroelectric phase field simulations of polycrystalline multi-phase hafnia and zirconia based ultra-thin film", Advanced Electronic Materials, June 25, 2024, doi: https://doi.org/10.1002/aelm.202400085

R. Jambunathan, Z. Yao, R. Lombardini, A. Rodriguez, and A. Nonaka, "Two-Fluid Physical Modeling of Superconducting Resonators in the ARTEMIS Framework", Computer Physics Communications, October 2, 2023, 291:108836,

P. Kumar, A. Nonaka, R. Jambunathan, G. Pahwa, S. Salahuddin, and Z. Yao, "FerroX: A GPU-accelerated, 3D Phase-Field Simulation Framework for Modeling Ferroelectric Devices", Computer Physics Communications, September 1, 2023, 290:108757,

S. S. Sawant, Z. Yao, R. Jambunathan, A. Nonaka, "Characterization of Transmission Lines in Microelectronic Circuits Using the ARTEMIS Solver", IEEE Journal on Multiscale and Multiphysics Computational Techniques, December 12, 2022, 8:31-39,

Z. Yao, R. Jambunathan, Y. Zeng, and A. Nonaka, "A Massively Parallel Time-Domain Coupled Electrodynamics-Micromagnetics Solver", International Journal of High Performance Computing Applications, January 10, 2022, accepted,

Meriam Gay Bautista, Zhi Jackie Yao, Anastasiia Butko, Mariam Kiran, Mekena Metcalf, "Towards Automated Superconducting Circuit Calibration using Deep Reinforcement Learning", 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Tampa, FL, USA, IEEE, August 23, 2021, pp. 462-46, doi: 10.1109/ISVLSI51109.2021.00091

Katherine A. Yelick, Amir Kamil, Damian Rouson, Dan Bonachea, Paul H. Hargrove, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC21), Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), November 15, 2021,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. UPC++ offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between computation and asynchronous data movement. UPC++ supports simple/regular data structures as well as more elaborate distributed applications where communication is fine-grained and/or irregular. UPC++ provides a uniform abstraction for one-sided RMA between host and GPU/accelerator memories anywhere in the system. UPC++'s support for aggressive asynchrony enables applications to effectively overlap communication and reduce latency stalls, while the underlying GASNet-EX communication library delivers efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces UPC++, covering the memory and execution models and basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into application proxy examples. We examine a few UPC++ applications with irregular communication (metagenomic assembler and COVID-19 simulation) and describe how they utilize UPC++ to optimize communication performance.

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu, QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space, Bulletin of the American Physical Society, March 2021,

We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.

O Selvitopi, B Brock, I Nisa, A Tripathy, K Yelick, A Buluç, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication", Proceedings of the International Conference on Supercomputing, January 2021, 431--442, doi: 10.1145/3447818.3461472

G Guidi, M Ellis, A Buluç, K Yelick, D Culler, "10 years later: Cloud computing is closing the performance gap", ICPE 2021 - Companion of the ACM/SPEC International Conference on Performance Engineering, January 1, 2021, 41--48, doi: 10.1145/3447545.3451183

Mathias Weiden, Justin Kalloor, John Kubiatowicz, Ed Younis, Costin Iancu, "Wide Quantum Circuit Optimization with Topology Aware Synthesis", Third International Workshop on Quantum Computing Software, November 13, 2022,

Unitary synthesis is an optimization technique that can achieve optimal gate counts while mapping quantum circuits to restrictive qubit topologies. Synthesis algorithms are limited in scalability by their exponentially growing run times. Application to wide circuits requires partitioning into smaller components. In this work, we explore methods to reduce depth and multi-qubit gate count of wide, mapped quantum circuits using synthesis. We present TopAS, a topology aware synthesis tool that preconditions quantum circuits before mapping. Partitioned subcircuits are optimized and fitted to sparse subtopologies to balance the opposing demands of synthesis and mapping algorithms. Compared to state of the art wide circuit synthesis algorithms, TopAS is able to reduce depth on average by 35.2% and CNOT count by 11.5% for mesh topologies. Compared to the optimization and mapping algorithms of Qiskit and Tket, TopAS is able to reduce CNOT counts by 30.3% and depth by 38.2% on average.

Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu, QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space, Bulletin of the American Physical Society, March 2021,

We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.

Sean R Miller, Matthew Schipper, Lars G Fritsche, Ralph Jiang, Garth Strohbehn, Erkin Ötleş, Benjamin H McMahon, Silvia Crivelli, Rafael Zamora‐Resendiz, Nithya Ramnath, Shinjae Yoo, Xin Dai, Kamya Sankar, Donna M Edwards, Steven G Allen, Michael D Green, Alex K Bryant, "Pan‐Cancer Survival Impact of Immune Checkpoint Inhibitors in a National Healthcare System", November 7, 2024,

Alex K Bryant, Rafael Zamora‐Resendiz, Xin Dai, Destinee Morrow, Yuewei Lin, Kassidy M Jungles, James M Rae, Akshay Tate, Ashley N Pearson, Ralph Jiang, Lars Fritsche, Theodore S Lawrence, Weiping Zou, Matthew Schipper, Nithya Ramnath, Shinjae Yoo, Silvia Crivelli, Michael D Green, "Artificial intelligence to unlock real‐world evidence in clinical oncology: A primer on recent advances", Cancer Medicine, June 20, 2024, doi: https://doi.org/10.1002/cam4.7253

Sayera Dhaubhadel, Kumkum Ganguly, Ruy M Ribeiro, Judith D Cohn, James M Hyman, Nicolas W Hengartner, Beauty Kolade, Anna Singley, Tanmoy Bhattacharya, Patrick Finley, Drew Levin, Haedi Thelen, Kelly Cho, Lauren Costa, Yuk-Lam Ho, Amy C Justice, John Pestian, Daniel Santel, Rafael Zamora-Resendiz, Silvia Crivelli, Suzanne Tamang, Susana Martins, Jodie Trafton, David W Oslin, Jean C Beckham, Nathan A Kimbrel, Benjamin H McMahon, "High dimensional predictions of suicide risk in 4.2 million US Veterans using ensemble transfer learning", scientific reports, January 20, 2024,

Rafael Zamora-Resendiz , David W. Oslin, Dina Hooshyar, Silvia Crivelli, "Using Electronic Health Record Metadata to Predict Housing Instability amongst Veterans", Preventive Medicine Reports, November 7, 2023,

Nathan A. Kimbrel, Allison E. Ashley-Koch, Xue J. Qin, Jennifer H. Lindquist, Melanie E. Garrett, Michelle F. Dennis, Lauren P. Hair, Jennifer E. Huffman, Daniel A. Jacobson, Ravi K. Madduri, Jodie A. Trafton, Hilary Coon, Anna R. Docherty, Niamh Mullins, Douglas M. Ruderfer, Philip D. Harvey, Benjamin H. McMahon, David W. Oslin, Jean C. Beckham, Elizabeth R. Hauser, Michael A. Hauser, Million Veteran Program Suicide Exemplar Workgroup, International Suicide Genetics Consortium, Veterans Affairs Mid-Atlantic Mental Illness Research Education and Clinical Center Workgroup, Veterans Affairs Million Veteran Program, "Identification of Novel, Replicable Genetic Risk Loci for Suicidal Thoughts and Behaviors Among US Military Veterans", JAMA Psychiatry, February 1, 2023, 80:100-191, doi: 10.1001/jamapsychiatry.2022.3896

Xiange Wang, Rafael Zamora-Resendiz, Courtney D. Shelley, Carrie Manore, Xinlian Liu, David W. Oslin, Benjamin McMahon, Jean C. Beckham, Nathan A. Kimbrel, Silvia Crivelli, "An examination of the association between altitude and suicide deaths, suicide attempts, and suicidal ideation among veterans at both the patient and geospatial level", Journal of Psychiatric Research, July 11, 2022,

Destinee Morrow, Rafael Zamora-Resendiz, Jean C Beckham, Nathan A Kimbrel, David W Oslin, Suzanne Tamang, Million Veteran Program Suicide Exemplar Workgroup, Silvia Crivelli, "A case for developing domain-specific vocabularies for extracting suicide factors from healthcare notes", Journal of Psychiatric Research, July 1, 2022, 151:328-338, doi: 10.1016/j.jpsychires.2022.04.009

Wei Zhang, Khaled Ibrahim, Suren Byna, "Optimizing Distributed Object Storage I/O for Large-scale Parallel GNN Training on Atomistic Graphs", UnderReview, July 11, 2025,

Suben Kumar Saha, Houjun Tang, Wei Zhang, Suren Byna, "Distributed Metadata Querying on HPC Systems", Under Review, July 10, 2025,

Chenxu Niu, Wei Zhang, Yongjian Zhao, Yong Chen, "Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines", HotCarbon Workshop on Sustainable Computer Systems 2025, July 10, 2025,

Chenxu Niu, Wei Zhang, Mert Side, Yong Chen, "ICEAGE: Intelligent Contextual Exploration and Answer Generation Engine for Scientific Data Discovery", 37th International Conference on Scalable Scientific Data Management, June 23, 2025,

Hyunju Oh, Wei Zhang, Christopher D. Rickett, Sreenivas R. Sukumar, Suren Byna, "Evaluating Performance Trade-offs of Caching Strategies for AI-Powered Querying Systems", 2024 IEEE International Conference on Big Data (IEEE BigData 2024), Washington DC, USA, 2024, doi: 10.1109/BigData62323.2024.10825819

Download File: Evaluating_Performance_Trade-offs_of_Caching_Strategies_for_AI-Powered_Querying_Systems.pdf (pdf: 6.6 MB)

With the rapid growth of accumulated data from

various scientific domains, traditional data management systems

face challenges in supporting complicated queries, such as pattern

search, on massive amounts of data. To serve sophisticated

queries through capturing precise features from data, recent

data management systems seek to use artificial intelligence

(AI) within the querying process. However, the characteristic

of AI inference workflow within the querying process, such as

intensive computation and expensive requirements for computing

resources, becomes a bottleneck of the AI-powered query systems.

In this paper, we provide a generalization of AI inference

workflow in the context of AI-powered data discovery and we

introduce three different caching strategies corresponding to

each stage in the AI inference workflow. We provide in-depth

performance evaluation on the impact of these caching strategies

through a series of strong scaling experiments. Our experimental

results show that the AI-powered data querying performance can

be significantly improved by applying different caching strategies.

Wei Zhang, Houjun Tang, Suren Byna, "BULKI - Binary Unified Layout for Key-value Interchange", 9th International Parallel Data Systems Workshop (PDSW), 2024,

Wei Zhang, Houjun Tang, Suren Byna, "IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management", The 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing. Philadelphia, 2024 (CCGrid 2024), Philadelphia, PA, USA, IEEE, May 9, 2024, doi: 10.1109/CCGrid59990.2024.00072

Download File: 956600a598.pdf (pdf: 782 KB)

Chenxu Niu, Wei Zhang, Suren Byna, Yong Chen, "PSQS: Parallel Semantic Querying Service for Self-describing File Formats", 2023 IEEE International Conference on Big Data (BigData), December 1, 2023, doi: 10.1109/BigData59044.2023.10386205

Chenxu Niu, Wei Zhang, Suren Byna, Yong Chen, "Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes", 2022 IEEE Conference on High Performance Extreme Computing (HPEC), September 19, 2022, doi: 10.1109/HPEC55821.2022.9926389

Wei Zhang, Suren Byna, Hyogi Sim, Sangkeun Lee, Sudharshan Vazhkudai, and Yong Chen,, "Exploiting User Activeness for Data Retention in HPC Systems", International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21), November 21, 2021, doi: https://doi.org/10.1145/3458817.3476201

Download File: 3458817.3476201-2.pdf (pdf: 1.5 MB)

Wei Zhang, Software Release: ActiveDR v1.0.6, August 7, 2021, doi: 10.5281/zenodo.5168853

Efficient scientific data discovery over self-describing file formats, Wei Zhang, June 1, 2021,

I. Srivastava, A. J. Nonaka, W. Zhang, A. L. Garcia, and J. B. Bell, "Molecular Fluctuations Inhibit Intermittency in Compressible Turbulence", Submitted for publication, January 11, 2025, doi: https://doi.org/10.48550/arXiv.2501.06396

L. Esclapez, M. Day, J. Bell, A. Felden, C. Gilet, R. Grout, M. Henry de Frahan, E. Motheau, A. Nonaka, L. Owen, B. Perry, J. Rood, N. Wimer, and W. Zhang, "PeleLMeX: an AMR Low Mach Number Reactive Flow Simulation Code without level sub-cycling", Journal of Open Source Software, October 31, 2023, 8, doi: doi.org/10.21105/joss.05450

H. Klion, R. Jambunathan, M. E. Rowan, E. Yang, D. Willcox, J.-L. Vay, R. Lehe, A. Myers, A. Huebl, W. Zhang, "Particle-in-Cell Simulations of Relativistic Magnetic Reconnection with Advanced Maxwell Solver Algorithms", The Astrophysical Journal, July 13, 2023, 952,

Jean Sexton, Zarija Lukic, Ann Almgren, Chris Daley, Brian Friesen, Andrew Myers, and Weiqun Zhang, "Nyx: A Massively Parallel AMR Code for Computational Cosmology", The Journal Of Open Source Software, July 10, 2021,

Weiqun Zhang, Andrew Myers, Kevin Gott, Ann Almgren and John Bell, "AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications", The International Journal of High Performance Computing Applications, June 12, 2021,

Jordan Musser, Ann S Almgren, William D Fullmer, Oscar Antepara, John B Bell, Johannes Blaschke, Kevin Gott, Andrew Myers, Roberto Porcu, Deepak Rangarajan, Michele Rosso, Weiqun Zhang, and Madhava Syamlal, "MFIX:Exa: A Path Towards Exascale CFD-DEM Simulations", The International Journal of High Performance Computing Applications, April 16, 2021,

Karol Kowalski, Raymond Bair, Nicholas P. Bauman, Jeffery S. Boschen, Eric J. Bylaska, Jeff Daily, Wibe A. de Jong, Thom Dunning, Niranjan Govind, Robert J. Harrison, Murat Keceli, Kristopher Keipert, Sriram Krishnamoorthy, Suraj Kumar, Erdal Mutlu, Bruce Palmer, Ajay Panyala, Bo Peng, Ryan M. Richard, T. P. Straatsma, Peter Sushko, Edward F. Valeev, Marat Valiev, Hubertus J. J. van Dam, Jonathan M. Waldrop, David B. Williams-Young, Chao Yang, Marcin Zalewski, Theresa L. Windus, "From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape", Chemical Reviews, March 31, 2021, doi: 10.1021/acs.chemrev.0c00998

Brandon Cook, "US19 Reqs/specs for additional IEEE-754 math functions", INCITS/US Fortran Programming Language Standards Technical Committee (J3/25-143),, June 2025,

In an effort to make Fortran as fully compliant with IEEE-754 as possible, the intrinsic functions listed in this paper are proposed as additions to the standard. Fortran already has many of the operations available, and a table is provided to illustrate which are already available and which are missing.

"Pagoda Updates PGAS Programming With Scalable Data Structures And Aggressively Asynchronous Communication", Rob Farber, Exascale Computing Project News, August 28, 2023, doi: 10.25344/S4SP4H

Nicholson Koukpaizan, Roofline Analysis using AMD Tools on AMD GPUs, ECP Annual Meeting, February 2023,

Neil Mehta, Roofline Performance Analysis on NVIDIA GPUs, ECP Annual Meeting, February 2023,

JaeHyuk Kwack, Roofline Performance Analysis w/Intel Advisor on Intel CPUs & GPUs, ECP Annual Meeting, February 2023,

JaeHyuk Kwack, ROOFLINE PERFORMANCE ANALYSIS W/ INTEL ADVISOR ON INTEL CPUS & GPUS, ECP Annual Meeting, May 2022,

Download File: ECP22-Roofline-4-Intel-and-ALCF.pdf (pdf: 14 MB)

Neil Mehta, Roofline on NVIDIA at NERSC, ECP Annual Meeting, May 2022,

Download File: ECP22-Roofline-2-NVIDIA-and-NERSC.pdf (pdf: 2.6 MB)

S. Dhawan, A. Goobar, M. Smith, J. Johansson, M. Rigault, J. Nordin, R. Biswas, D. Goldstein, P. Nugent, Y. -L. Kim, A. A. Miller, M. J. Graham, M. Medford, M. M. Kasliwal, S. R. Kulkarni, Dmitry A. Duev, E. Bellm, P. Rosnet, R. Riddle, J. Sollerman, The Zwicky Transient Facility Type Ia supernova survey: first data release and results, Monthly Notices of the RAS, Pages: 2228-2241 2022, doi: 10.1093/mnras/stab3093

Yuan Qi Ni, Dae-Sik Moon, Maria R. Drout, Abigail Polin, David J. Sand, Santiago Gonz\ alez-Gait\ an, Sang Chul Kim, Youngdae Lee, Hong Soo Park, D. Andrew Howell, Peter E. Nugent, Anthony L. Piro, Peter J. Brown, Llu\ \is Galbany, Jamison Burke, Daichi Hiramatsu, Griffin Hosseinzadeh, Stefano Valenti, Niloufar Afsariardchi, Jennifer E. Andrews, John Antoniadis, Iair Arcavi, Rachael L. Beaton, K. Azalee Bostroem, Raymond G. Carlberg, S. Bradley Cenko, Sang-Mok Cha, Yize Dong, Avishay Gal-Yam, Joshua Haislip, Thomas W. -S. Holoien, Sean D. Johnson, Vladimir Kouprianov, Yongseok Lee, Christopher D. Matzner, Nidia Morrell, Curtis McCully, Giuliano Pignata, Daniel E. Reichart, Jeffrey Rich, Stuart D. Ryder, Nathan Smith, Samuel Wyatt, Sheng Yang, Infant-phase reddening by surface Fe-peak elements in a normal type Ia supernova, Nature Astronomy, 2022, doi: 10.1038/s41550-022-01603-4

Melissa L. Graham, Christoffer Fremling, Daniel A. Perley, Rahul Biswas, Christopher A. Phillips, Jesper Sollerman, Peter E. Nugent, Sarafina Nance, Suhail Dhawan, Jakob Nordin, Ariel Goobar, Adam Miller, James D. Neill, Xander J. Hall, Matthew J. Hankins, Dmitry A. Duev, Mansi M. Kasliwal, Mickael Rigault, Eric C. Bellm, David Hale, Przemek Mr\ oz, S. R. Kulkarni, Supernova siblings and their parent galaxies in the Zwicky Transient Facility Bright Transient Survey, Monthly Notices of the RAS, Pages: 241-254 2022, doi: 10.1093/mnras/stab3802

Jonathan Madsen, Roofline Instrumentation with TiMemory, ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-7-TiMemory.pdf (pdf: 490 KB)

Jonathan Madsen, Roofline Model using NSight Compute, ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-3-NERSC.pdf (pdf: 4 MB)

G Guidi, M Ellis, A Buluç, KA Yelick, DE Culler, 10 Years Later: Cloud Computing is Closing the Performance Gap., ICPE (Companion), Pages: 41--48 2021,

E Younis, K Sen, K Yelick, C Iancu, QFAST: Conflating Search and Numerical Optimization for Scalable Quantum Circuit Synthesis, Proceedings - 2021 IEEE International Conference on Quantum Computing and Engineering, QCE 2021, Pages: 232--243 2021, doi: 10.1109/QCE52317.2021.00041

M Ellis, A Buluç, K Yelick, Asynchrony versus bulk-synchrony for a generalized N-body problem from genomics, Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Pages: 465--466 2021, doi: 10.1145/3437801.3441580

I Nisa, P Pandey, M Ellis, L Oliker, A Buluc, K Yelick, Distributed-memory k-mer counting on GPUs, Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021, Pages: 527--536 2021, doi: 10.1109/IPDPS49936.2021.00061

G Blelloch, W Dally, M Martonosi, U Vishkin, K Yelick, SPAA 21 panel paper: Architecture-friendly algorithms versus algorithm-friendly architectures, Annual ACM Symposium on Parallelism in Algorithms and Architectures, Pages: 1--7 2021, doi: 10.1145/3409964.3461780

M Norman, V Kellen, S Smallen, B Demeulle, S Strande, E Lazowska, N Alterman, R Fatland, S Stone, A Tan, K Yelick, E Van Dusen, J Mitchell, CloudBank: Managed Services to Simplify Cloud Access for Computer Science Research and Education, ACM International Conference Proceeding Series, 2021, doi: 10.1145/3437359.3465586

M Ellis, A Buluc, K Yelick, Scaling Generalized N-Body Problems, A Case Study from Genomics, ACM International Conference Proceeding Series, 2021, doi: 10.1145/3472456.3472517

Jeremy Hewes, others, Graph Neural Network for Object Reconstruction in Liquid Argon Time Projection Chambers, EPJ Web Conf., Pages: 03054 2021, doi: 10.1051/epjconf/202125103054

Sabrina Amrouche, others, The Tracking Machine Learning challenge : Throughput phase, 2021,

Abigail Polin, Peter Nugent, Daniel Kasen, Nebular Models of Sub-Chandrasekhar Mass Type Ia Supernovae: Clues to the Origin of Ca-rich Transients, Astrophysical Journal, Pages: 65 2021, doi: 10.3847/1538-4357/abcccc

C. Frohmaier, C. R. Angus, M. Vincenzi, M. Sullivan, M. Smith, P. E. Nugent, S. B. Cenko, A. Gal-Yam, S. R. Kulkarni, N. M. Law, R. M. Quimby, From core collapse to superluminous: the rates of massive stellar explosions from the Palomar Transient Factory, Monthly Notices of the RAS, Pages: 5142-5158 2021, doi: 10.1093/mnras/staa3607

S. Yang, J. Sollerman, T. -W. Chen, E. C. Kool, R. Lunnan, S. Schulze, N. Strotjohann, A. Horesh, M. Kasliwal, T. Kupfer, A. A. Mahabal, F. J. Masci, P. Nugent, D. A. Perley, R. Riddle, B. Rusholme, Y. Sharma, Is supernova SN 2020faa an iPTF14hls look-alike?, Astronomy and Astrophysics, Pages: A22 2021, doi: 10.1051/0004-6361/202039440

Nora L. Strotjohann, Eran O. Ofek, Avishay Gal-Yam, Rachel Bruch, Steve Schulze, Nir Shaviv, Jesper Sollerman, Alexei V. Filippenko, Ofer Yaron, Christoffer Fremling, Jakob Nordin, Erik C. Kool, Dan A. Perley, Anna Y. Q. Ho, Yi Yang, Yuhan Yao, Maayane T. Soumagnac, Melissa L. Graham, Cristina Barbarino, Leonardo Tartaglia, Kishalay De, Daniel A. Goldstein, David O. Cook, Thomas G. Brink, Kirsty Taggart, Lin Yan, Ragnhild Lunnan, Mansi Kasliwal, Shri R. Kulkarni, Peter E. Nugent, Frank J. Masci, Philippe Rosnet, Scott M. Adams, Igor Andreoni, Ashot Bagdasaryan, Eric C. Bellm, Kevin Burdge, Dmitry A. Duev, Alison Dugas, Sara Frederick, Samantha Goldwasser, Matthew Hankins, Ido Irani, Viraj Karambelkar, Thomas Kupfer, Jingyi Liang, James D. Neill, Michael Porter, Reed L. Riddle, Yashvi Sharma, Phil Short, Francesco Taddia, Anastasios Tzanidakis, Jan van Roestel, Richard Walters, Zhuyun Zhuang, Bright, Months-long Stellar Outbursts Announce the Explosion of Interaction-powered Supernovae, Astrophysical Journal, Pages: 99 2021, doi: 10.3847/1538-4357/abd032

J. Johansson, A. Goobar, S. H. Price, A. Sagu\ es Carracedo, L. Della Bruna, P. E. Nugent, S. Dhawan, E. M\ ortsell, S. Papadogiannakis, R. Amanullah, D. Goldstein, S. B. Cenko, K. De, A. Dugas, M. M. Kasliwal, S. R. Kulkarni, R. Lunnan, Spectroscopy of the first resolved strongly lensed Type Ia supernova iPTF16geu, Monthly Notices of the RAS, Pages: 510-520 2021, doi: 10.1093/mnras/staa3829

Chelsea E. Harris, Laura Chomiuk, Peter. E. Nugent, Tumbling Dice: Radio Constraints on the Presence of Circumstellar Shells around Type Ia Supernovae with Impact Near Maximum Light, Astrophysical Journal, Pages: 23 2021, doi: 10.3847/1538-4357/abe940

Charlotte Ward, Suvi Gezari, Sara Frederick, Erica Hammerstein, Peter Nugent, Sjoert van Velzen, Andrew Drake, Abigail Garc\ \ia-P\ erez, Immaculate Oyoo, Eric C. Bellm, Dmitry A. Duev, Matthew J. Graham, Mansi M. Kasliwal, Stephen Kaye, Ashish A. Mahabal, Frank J. Masci, Ben Rusholme, Maayane T. Soumagnac, Lin Yan, AGNs on the Move: A Search for Off-nuclear AGNs from Recoiling Supermassive Black Holes and Ongoing Galaxy Mergers with the Zwicky Transient Facility, Astrophysical Journal, Pages: 102 2021, doi: 10.3847/1538-4357/abf246

Michael S. Medford, Peter Nugent, Danny Goldstein, Frank J. Masci, Igor Andreoni, Ron Beck, Michael W. Coughlin, Dmitry A. Duev, Ashish A. Mahabal, Reed L. Riddle, Removing Atmospheric Fringes from Zwicky Transient Facility i-band Images using Principal Component Analysis, Publications of the ASP, Pages: 064503 2021, doi: 10.1088/1538-3873/abfe9d

Steve Schulze, Ofer Yaron, Jesper Sollerman, Giorgos Leloudas, Amit Gal, Angus H. Wright, Ragnhild Lunnan, Avishay Gal-Yam, Eran O. Ofek, Daniel A. Perley, Alexei V. Filippenko, Mansi M. Kasliwal, Shrinivas R. Kulkarni, James D. Neill, Peter E. Nugent, Robert M. Quimby, Mark Sullivan, Nora Linn Strotjohann, Iair Arcavi, Sagi Ben-Ami, Federica Bianco, Joshua S. Bloom, Kishalay De, Morgan Fraser, Christoffer U. Fremling, Assaf Horesh, Joel Johansson, Patrick L. Kelly, Nikola Kne\vzevi\ c, Sladjana Kne\vzevi\ c, Kate Maguire, Anders Nyholm, Sem\ eli Papadogiannakis, Tanja Petrushevska, Adam Rubin, Lin Yan, Yi Yang, Scott M. Adams, Filomena Bufano, Kelsey I. Clubb, Ryan J. Foley, Yoav Green, Jussi Harmanen, Anna Y. Q. Ho, Isobel M. Hook, Griffin Hosseinzadeh, D. Andrew Howell, Albert K. H. Kong, Rubina Kotak, Thomas Matheson, Curtis McCully, Dan Milisavljevic, Yen-Chen Pan, Dovi Poznanski, Isaac Shivvers, Sjoert van Velzen, Kars K. Verbeek, The Palomar Transient Factory Core-collapse Supernova Host-galaxy Sample. I. Host-galaxy Distribution Functions and Environment Dependence of Core-collapse Supernovae, Astrophysical Journal Supplement, Pages: 29 2021, doi: 10.3847/1538-4365/abff5e

C. Ashall, J. Lu, E. Y. Hsiao, P. Hoeflich, M. M. Phillips, L. Galbany, C. R. Burns, C. Contreras, K. Krisciunas, N. Morrell, M. D. Stritzinger, N. B. Suntzeff, F. Taddia, J. Anais, E. Baron, P. J. Brown, L. Busta, A. Campillay, S. Castell\ on, C. Corco, S. Davis, G. Folatelli, F. F\ orster, W. L. Freedman, C. Gonzal\ ez, M. Hamuy, S. Holmbo, R. P. Kirshner, S. Kumar, G. H. Marion, P. Mazzali, T. Morokuma, P. E. Nugent, S. E. Persson, A. L. Piro, M. Roth, F. Salgado, D. J. Sand, J. Seron, M. Shahbandeh, B. J. Shappee, Carnegie Supernova Project: The First Homogeneous Sample of Super-Chandrasekhar-mass/2003fg-like Type Ia Supernovae, Astrophysical Journal, Pages: 205 2021, doi: 10.3847/1538-4357/ac19ac

J. Johansson, S. B. Cenko, O. D. Fox, S. Dhawan, A. Goobar, V. Stanishev, N. Butler, W. H. Lee, A. M. Watson, U. C. Fremling, M. M. Kasliwal, P. E. Nugent, T. Petrushevska, J. Sollerman, L. Yan, J. Burke, G. Hosseinzadeh, D. A. Howell, C. McCully, S. Valenti, Near-infrared Supernova Ia Distances: Host Galaxy Extinction and Mass-step Corrections Revisited, Astrophysical Journal, Pages: 237 2021, doi: 10.3847/1538-4357/ac2f9e

K Yelick, D Agarwal, D Bard, J Shalf, A Almgren, W Bhimji, B Brown, J Carter, B Jong, D Doerfler, D Donofrio, C Guok, C Iancu, M Kiran, S Li, P Nugent, M Prabhat, L Ramakrishnan, D Vasudevan, N Wright, H Cademartori, K Antypas, K Kincade, 2019 Computing Sciences Strategic Plan, 2021, doi: 10.2172/1827673

Brad Mitchell, Ravi Naik, Alexis Morvan, Akel Hashim, John Mark Kreikebaum, David Santiago, Irfan Siddiqi, Calibration of the Cross-Resonance Gate using Closed-Loop Optimal Control, Bulletin of the American Physical Society, 2021,

Gerwin Koolstra, Noah Stevenson, Karthik Siva, William Livingston, Ravi Naik, John Steinmetz, Debmalya Das, Andrew Jordan, David Santiago, Irfan Siddiqi, Diagnosing Gate Errors in Superconducting Qubits Using Continuous Measurements (Experiment), Bulletin of the American Physical Society, 2021,

Ravi Naik, Brad Mitchell, Akel Hashim, John Mark Kreikebaum, David Santiago, Irfan Siddiqi, Contextual Characterization of the Cross-Resonance Gate on a Multi-Qubit Superconducting Quantum Processor, Bulletin of the American Physical Society, 2021,

Robin Blume-Kohout, Susan Clark, Akel Hashim, Craig Hogle, Daniel Lobser, Ravi Naik, Timothy Proctor, Kenneth Rudinger, David Santiago, Irfan Siddiqi, others, Simultaneous Gate Set Tomography, Bulletin of the American Physical Society, 2021,

Joachim Cohen, Agustin Di Paolo, Larry Chen, Trevor Chistolini, John Mark Kreikebaum, Long Nguyen, Ravi Naik, David Santiago, Irfan Siddiqi, Alexandre Blais, Novel two-qubit gates for the light fluxonium qubit, Bulletin of the American Physical Society, 2021,

Alexis Morvan, Vinay Ramasesh, Machiel Blok, John Mark Kreikebaum, Kevin O Brien, Larry Chen, Ravi Naik, Brad Mitchell, David Santiago, Irfan Siddiqi, Qutrit Randomized Benchmarking on a Transmon Quantum Processor, Bulletin of the American Physical Society, 2021,

John Steinmetz, Debmalya Das, Gerwin Koolstra, Noah Stevenson, Karthik Siva, William Livingston, Ravi Naik, David Santiago, Irfan Siddiqi, Andrew Jordan, Diagnosing Errors in Qubit Gates Using Continuous Measurements (Theory), Bulletin of the American Physical Society, 2021,

Noah Stevenson, Gerwin Koolstra, Karthik Siva, Ravi Naik, William Livingston, Shiva Lotfallahzadeh Barzili, Justin Dressel, Irfan Siddiqi, Tracking Non-Markovian Quantum Trajectories of a Superconducting Qubit from a Finite-Memory Bath, Bulletin of the American Physical Society, 2021,

Yilun Xu, Gang Huang, Ravi Naik, Alexis Morvan, Kasra Nowrouzi, Brad Mitchell, David Santiago, Irfan Siddiqi, Automatic two-qubit gate calibration with qubic, Bulletin of the American Physical Society, 2021,

Kevin He, Srivatsan Chakram, Akash Dixit, Andrew Oriani, Ravi Naik, Nelson Leung, Hyeokshin Kwon, Riju Banerjee, Wen-Long Ma, Liang Jiang, others, State preparation and tomography in 3D multimode circuit QED, Bulletin of the American Physical Society, 2021,

Jean-Loup Ville, Alexis Morvan, Akel Hashim, Ravi K Naik, Bradley Mitchell, John-Mark Kreikebaum, Kevin P O Brien, Joel J Wallman, Ian Hincks, Joseph Emerson, others, Leveraging Randomized Compiling for the QITE Algorithm, arXiv preprint arXiv:2104.08785, 2021,

Recent Publications

Mark F Adams

2025

M. Adams, P. Wang, J. Merson, K. Huck, M. Knepley, "A performance portable, fully implicit Landau collision operator with batched linear solvers", SIAM Journal on Scientific Computing, January 1, 2025,

2023

Daniel Finn, Matthew Knepley, Joseph Pusztay and Mark Adams, "A Numerical Study of Landau Damping with PETSc-PIC", CAMCoS, March 1, 2023, doi: 10.2140/camcos.2023.18.135

2022

M.F. Adams, D.P. Brennan, M.G. Knepley, P. Wang, "Landau collision operator in the CUDA programming model applied to thermal quench plasmas", 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), July 15, 2022, doi: 10.1109/IPDPS53621.2022.00020

J. V. Pusztay, M. G. Knepley, and M. F. Adams, "Conservative Projection Between FEM and Particle Bases", SIAM Journal on Scientific Computing, January 1, 2022, doi: https://doi.org/10.1137/21M145407

2021

N.B. Bonnheim, M.F. Adams, T. Wu, T.M. Keaveny, "The Role of Vertebral Porosity and Implant Loading Mode on Bone-Tissue Stress in the Human Vertebral Body Following Lumbar Total Disc Arthroplasty", Spine, October 1, 2021, 1022-E1030, doi: 10.1097/BRS.0000000000004023

Deborah Agarwal

2023

2022

C Varadharajan, VC Hendrix, DS Christianson, M Burrus, C Wong, SS Hubbard, DA Agarwal, BASIN-3D: A brokering framework to integrate diverse environmental data, Computers and Geosciences, 2022, doi: 10.1016/j.cageo.2021.105024

B Faybishenko, R Versteeg, G Pastorello, D Dwivedi, C Varadharajan, D Agarwal, Challenging problems of quality assurance and quality control (QA/QC) of meteorological time series data, Stochastic Environmental Research and Risk Assessment, Pages: 1049--1062 2022, doi: 10.1007/s00477-021-02106-w

F Molz, B Faybishenko, D Agarwal, A broad exploration of nonlinear dynamics in microbial systems motivated by chemostat experiments producing deterministic chaos., 2022,

2021

D. A. Agarwal, J. Damerow, C. Varadharajan, D. S. Christianson, G. Z. Pastorello, Y.-W. Cheah, L. Ramakrishnan, "Balancing the needs of consumers and producers for scientific data collections", Ecological Informatics, 2021, 62:101251, doi: 10.1016/j.ecoinf.2021.101251

J Müller, B Faybishenko, D Agarwal, S Bailey, C Jiang, Y Ryu, C Tull, L Ramakrishnan, Assessing data change in scientific datasets, Concurrency and Computation: Practice and Experience, 2021, doi: 10.1002/cpe.6245

Venkatesh Akella

2022

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power, "SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems", Proceedings of the 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED), September 2022,

2021

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power,, "Enabling Design Space Exploration for RISC-V Secure Compute Environments", Proceedings of the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV), (co-located with ISCA 2021), June 17, 2021,

Ayaz Akram

2022

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power, "SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems", Proceedings of the 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED), September 2022,

2021

Ayaz Akram, Venkatesh Akella, Sean Peisert, Jason Lowe-Power,, "Enabling Design Space Exploration for RISC-V Secure Compute Environments", Proceedings of the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV), (co-located with ISCA 2021), June 17, 2021,

Ann S. Almgren

2021

Jean Sexton, Zarija Lukic, Ann Almgren, Chris Daley, Brian Friesen, Andrew Myers, and Weiqun Zhang, "Nyx: A Massively Parallel AMR Code for Computational Cosmology", The Journal Of Open Source Software, July 10, 2021,

Weiqun Zhang, Andrew Myers, Kevin Gott, Ann Almgren and John Bell, "AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications", The International Journal of High Performance Computing Applications, June 12, 2021,

Oluwamayowa O Amusat

2025

Chris A. Laliwala, Oluwamayowa O. Amusat, Ana I. Torres, "An Optimization-Based Law of Mass Action Precipitation/Dissolution Model", Systems and Control Transactions, Ghent, Belgium, PSE Press: Hamilton, July 1, 2025, 4:2140-2145, doi: https://doi.org/10.69997/sct.132742

2024

Oluwamayowa Amusat, Adam Atia, Timothy Bartholomew, Alexander Dudchenko, Cost-Optimization of Process-Scale Desalination Systems Incorporating Surrogate-based Water Chemistry Models, INFORMS Optimization Society Conference, March 22, 2024,

Oluwamayowa O Amusat, Adam A Atia, Alexander V Dudchenko, Timothy V Bartholomew, "Modeling Framework for Cost Optimization of Process-Scale Desalination Systems with Mineral Scaling and Precipitation", ACS ES&T Engineering, March 8, 2024, doi: 10.1021/acsestengg.3c00537

2023

2022

Oluwamayowa O. Amusat, Tim Barthlomew, Adam A. Atia, Cost optimization of desalination systems using WaterTAP incorporating detailed water chemistry models, 2022 INFORMS Annual Meeting, 2022,

2021

Dan Gunter, Oluwamayowa Amusat, Tim Bartholomew, Markus Drouven, "Santa Barbara Desalination Digital Twin Technical Report", LBNL Technical Report, 2021, LBNL LBNL-2001437,

Oscar Antepara

2025

Nan Ding, Oscar Antepara, Zhengji Zhao, Brian Austin, Leonid Oliker, Nicholas J. Wright, Samuel Williams, "Maximizing Power-Constrained Supercomputing Throughput", ISC'25, June 11, 2025,

2024

Oscar Antepara, Samuel Williams, Max Carlson, Jerry Watkins, "Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers", Performance, Portability & Productivity in HPC (P3HPC), November 2024,

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Mahesh Lakshminarasimhan, Mary Hall, Samuel Williams, Oscar Antepara, "BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs", Proceedings of the 53rd International Conference on Parallel Processing (ICPP), August 12, 2024,

2023

Oscar Antepara, Samuel Williams, Scott Kruger, Torrin Bechtel, Joseph McClenaghan, Lang Lao, "Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code", Workshop on Accelerator Programming and Directives (WACCPD), November 2023,

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Brian Austin

2025

Nan Ding, Oscar Antepara, Zhengji Zhao, Brian Austin, Leonid Oliker, Nicholas J. Wright, Samuel Williams, "Maximizing Power-Constrained Supercomputing Throughput", ISC'25, June 11, 2025,

2024

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Matthew Avaylon

2022

M. Avaylon, R. Sadre, Z. Bai, T. Perciano, "Adaptable Deep Learning and Probabilistic Graphical Model System for Semantic Segmentation", Advances in Artificial Intelligence and Machine Learnin, March 31, 2022, 2:288--302, doi: 10.54364/AAIML.2022.1119

Venkitesh Ayyar

2022

Venkitesh Ayyar, Robert Knop, Autumn Awbrey, Alexis Andersen, Peter Nugent, "Identifying Transient Candidates in the Dark Energy Survey Using Convolutional Neural Networks", Publications of the Astronomical Society of the Pacific, September 2022, 134:094501,

Ariful Azad

2021

Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, Ariful Azad, "Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale", 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, doi: 10.1109/IPDPS49936.2021.00018

John Bachan

2024

John Bachan, Jianlan Ye, Xuan Jiang, Tan Nguyen, Mahesh Natarajan, Maximilian Bremer, Cy Chan, "Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++", In 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS ’24), June 24, 2024,

2023

2022

Maximilian Bremer, John Bachan, Cy Chan, Clint Dawson, "Adaptive total variation stable local timestepping for conservation laws", Journal of Computational Physics, April 21, 2022,

2021

Maximilian Bremer, John Bachan, Cy Chan, and Clint Dawson, "Speculative Parallel Execution for Local Timestepping", 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, May 21, 2021,

Scott Baden

2023