DEGAS: Dynamic Exascale Global Address Space Programming Environments
The Dynamic, Exascale Global Address Space programming environment (DEGAS) project will develop the next generation of programming models, runtime systems and tools to meet the challenges of Exascale systems. We will develop a new set of programming concepts based on a hierarchical model of parallelism and data locality, hierarchical fault containment/recovery for resilience, introspective dynamic resource management, demonstrate them using extensions to existing languages, and evaluate their utility for applications. Our solution will address the following key challenges posed by exascale systems:
- Scalability: Efficient communication (extended GASNet) and synchronization mechanisms combined with compiler (ROSE) and runtime optimizations to minimize both.
- Programmability: Rich set of programming constructs based on a dynamic, resilient Partitioned Global Address Space (PGAS) model, demonstrated in multiple language dialects (C and FORTRAN).
- Performance Portability: Non-invasive profiling (IPM), deep code analysis (ROSE) and a dynamically Adaptive RunTime System (ARTS).
- Resilience: Containment Domains and state capture mechanisms and lightweight, asynchronous recovery mechanisms.
- Energy Efficiency: Runtime energy adaptation and communication-optimal code generation.
- Interoperability: Runtime and language interoperability with MPI, OpenMP and libraries (Lithe).
The DEGAS team will work with Co-Design centers to drive the programming construct design, combined with information about hardware platforms as it emerges. We will also leverage ongoing discussions with other application and vendor stakeholders as well as mainstream language standards groups, augmented with advisory committees and semi-annual retreats involving broad representation from all three groups.
Our approach focuses on a vertically integrated programming and execution environment that incorporates the latest algorithmic approaches and application structures to effectively service ultra-scale science and energy applications. The primary focus areas of DEGAS are shown in Figure 1 along with the proposed integrated software stack.
- Katherine Yelick, Principal Investigator (LBNL)
- Krste Asanović (UC Berkeley)
- James Demmel (UC Berkeley)
- Mattan Erez (UT Austin)
- Paul Hargrove (LBNL)
- Steven Hofmeyr (LBNL)
- Costin Iancu (LBNL)
- Khaled Ibrahim (LBNL)
- John Mellor-Crummey (Rice University)
- Leonid Oliker (LBNL)
- Dan Quinlan (LLNL)
- Eric Roman (LBNL)
- Vivek Sarkar (Rice University)
- Erich Strohmaier (LBNL)
- Yili Zheng (LBNL)
Nathan Zhang, Michael Driscoll, Armando Fox, Charles Markley, Samuel Williams, Protonu Basu, "Snowflake: A Lightweight Portable Stencil DSL", High-level Parallel Programming Models and Supportive Environments (HIPS), May 2017,
- Download File: hips17-snowflake.pdf (pdf: 475 KB)
Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends", Journal of Parallel and Distributed Computing (JPDC), February 2017, doi: 10.1016/j.jpdc.2017.02.010
E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109
M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6
Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720
Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends (tech report version)", LBNL. - Report Number: LBNL-1005853, July 1, 2016, LBNL 1005853, doi: 10.2172/1274416
H Shan, S Williams, Y Zheng, W Zhang, B Wang, S Ethier, Z Zhao, IEEE, "Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication", PROCEEDINGS OF PAW 2016: 1ST PGAS APPLICATIONS WORKSHOP (PAW), January 2016, 17--24, doi: 10.1109/PAW.2016.008
- Download File: PAW16-stencil.pdf (pdf: 601 KB)
Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12
- Download File: pgas15-hpgmg.pdf (pdf: 803 KB)
Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "MerAligner: A Fully Parallel Sequence Aligner", IEEE 29th International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 561--570, doi: 10.1109/IPDPS.2015.96
Aligning a set of query sequences to a set of target sequences is an important task in bioinformatics. In this work we present merAligner, a highly parallel sequence aligner that implements a seed -- and -- extend algorithm and employs parallelism in all of its components. MerAligner relies on a high performance distributed hash table (seed index) and uses one-sided communication capabilities of the Unified Parallel C to facilitate a fine-grained parallelism. We leverage communication optimizations at the construction of the distributed hash table and software caching schemes to reduce communication during the aligning phase. Additionally, merAligner preprocesses the target sequences to extract properties enabling exact sequence matching with minimal communication. Finally, we efficiently parallelize the I/O intensive phases and implement an effective load balancing scheme. Results show that merAligner exhibits efficient scaling up to thousands of cores on a Cray XC30 supercomputer using real human and wheat genome data while significantly outperforming existing parallel alignment tools.
Scott French, Yili Zheng, Barbara Romanowicz, Katherine Yelick, "Parallel Hessian Assembly for Seismic Waveform Inversion Using Global Updates", International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 753--762, doi: 10.1109/IPDPS.2015.58
We present the design and evaluation of a distributed matrix-assembly abstraction for large-scale inverse problems in HPC environments: namely, physics-based Hessian estimation in full-waveform seismic inversion at the scale of the entire globe. Our solution to this data-assimilation problem relies on UPC++, a new PGAS extension to the C++ language, to implement one-sided asynchronous updates to distributed matrix elements, and allows us to tackle inverse problems well beyond our previous capabilities. Our evaluation includes scaling results for Hessian estimation on up to 12, 288 cores, typical of current production scientific runs and next-generation inversions. We also present comparisons with an alternative implementation based on MPI-3 remote memory access (RMA) operations, focusing on performance and code complexity. Interoperability between UPC ++ and other parallel programming tools (e.g. MPI, OpenMP) allowed for incremental adoption of the PGAS model where most beneficial. Further, we note that this model of asynchronous assembly can generalize to other data-assimilation applications that accumulate updates into shared global state.
Nicholas Chaimov, Khaled Ibrahim, Samuel Williams, Costin Iancu, "Exploiting Communication Concurrency on High Performance Computing Systems", IJHPCA, April 17, 2015,
- Download File: thorserv2.pdf (pdf: 1.7 MB)
Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,
- Download File: pmam15-servers.pdf (pdf: 1.2 MB)
Milind Chabbi, Wim Lavrijsen, Wibe de Jong, Koushik Sen, John Mellor Crummey, Costin Iancu, "Barrier Elision for Production Parallel Programs", PPOPP 2015, February 5, 2015,
- Download File: nwbar.pdf (pdf: 663 KB)
E Georganas, A Buluç, J Chapman, S Hofmeyr, C Aluru, R Egan, L Oliker, D Rokhsar, K Yelick, "HipMer: An extreme-scale de novo genome assembler", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2015, 15-20-No, doi: 10.1145/2807591.2807664
M Chabbi, W Lavrijsen, W De Jong, K Sen, J Mellor-Crummey, C Iancu, "Barrier elision for production parallel programs", Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, January 1, 2015, 2015-Jan:109--119, doi: 10.1145/2688500.2688502
Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "Parallel de Bruijn Graph Construction and Traversal for de Novo Genome Assembly", International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November 16, 2014, 437--448, doi: 10.1109/SC.2014.41
- Download File: sc14genome.pdf (pdf: 719 KB)
Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlic, Vivek Sarkar, "HabaneroUPC++: a Compiler-free PGAS Library", 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014,
Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874
- Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)
Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.
Amir Kamil, Yili Zheng, Katherine Yelick, "A Local-View Array Library for Partitioned Global Address Space C++ Programs", ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), June 2014,
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, Katherine Yelick, "UPC++: A PGAS extension for C++", International Parallel and Distributed Processing Symposium (IPDPS), May 19, 2014, 1105--1114, doi: 10.1109/IPDPS.2014.115
Partitioned Global Address Space (PGAS) languages are convenient for expressing algorithms with large, random-access data, and they have proven to provide high performance and scalability through lightweight one-sided communication and locality control. While very convenient for moving data around the system, PGAS languages have taken different views on the model of computation, with the static Single Program Multiple Data (SPMD) model providing the best scalability. In this paper we present UPC++, a PGAS extension for C++ that has three main objectives: 1) to provide an object-oriented PGAS programming model in the context of the popular C++ language, 2) to add useful parallel programming idioms unavailable in UPC, such as asynchronous remote function invocation and multidimensional arrays, to support complex scientific applications, 3) to offer an easy on-ramp to PGAS programming through interoperability with other existing parallel programming systems (e.g., MPI, OpenMP, CUDA). We implement UPC++ with a "compiler-free" approach using C++ templates and runtime libraries. We borrow heavily from previous PGAS languages and describe the design decisions that led to this particular set of language features, providing significantly more expressiveness than UPC with very similar performance characteristics. We evaluate the programmability and performance of UPC++ using five benchmarks on two representative supercomputers, demonstrating that UPC++ can deliver excellent performance at large scale up to 32K cores while offering PGAS productivity features to C++ applications.
Samuel Williams, At Exascale, Will Bandwidth Be Free?, DOE ModSim Workshop, 2013,
- Download File: modsim2013SWWilliams.pdf (pdf: 408 KB)
Michael Garland, Manjunath Kudlur, Yili Zheng, "Designing a Unified Programming Model for Heterogeneous Machines", Supercomputing (SC), November 2012, doi: 10.1109/SC.2012.48
While high-efficiency machines are increasingly embracing heterogeneous architectures and massive multithreading, contemporary mainstream programming languages reflect a mental model in which processing elements are homogeneous, concurrency is limited, and memory is a flat undifferentiated pool of storage. Moreover, the current state of the art in programming heterogeneous machines tends towards using separate programming models, such as OpenMP and CUDA, for different portions of the machine. Both of these factors make programming emerging heterogeneous machines unnecessarily difficult. We describe the design of the Phalanx programming model, which seeks to provide a unified programming model for heterogeneous machines. It provides constructs for bulk parallelism, synchronization, and data placement which operate across the entire machine. Our prototype implementation is able to launch and coordinate work on both CPU and GPU processors within a single node, and by leveraging the GASNet runtime, is able to run across all the nodes of a distributed-memory machine.
Mads Kristensen, Yili Zheng, Brian Vinter, "PGAS for Distributed Numerical Python Targeting Multi-core Clusters", IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2012, doi: 10.1109/IPDPS.2012.67
In this paper we propose a parallel programming model that combines two well-known execution models: Single Instruction, Multiple Data (SIMD) and Single Program, Multiple Data (SPMD). The combined model supports SIMD-style data parallelism in global address space and supports SPMD-style task parallelism in local address space. One of the most important features in the combined model is that data communication is expressed by global data assignments instead of message passing. We implement this combined programming model into Python, making parallel programming with Python both highly productive and performing on distributed memory multi-core systems. We base the SIMD data parallelism on DistNumPy, an auto-parallel zing version of the Numerical Python (NumPy) package that allows sequential NumPy programs to run on distributed memory architectures. We implement the SPMD task parallelism as an extension to DistNumPy that enables each process to have direct access to the local part of a shared array. To harvest the multi-core benefits in modern processors we exploit multi-threading in both SIMD and SPMD execution models. The multi-threading is completely transparent to the user -- it is implemented in the runtime with Open MP and by using multi-threaded libraries when available. We evaluate the implementation of the combined programming model with several scientific computing benchmarks using two representative multi-core distributed memory systems -- an Intel Nehalem cluster with Infini band interconnects and a Cray XE-6 supercomputer -- up to 1536 cores. The benchmarking results demonstrate scalable good performance.