Amir Kamil is a Computer Systems Engineer in the Computer Languages and Systems Software group. He is also a Lecturer in Electrical Engineering and Computer Science at the University of Michigan. His research interests include compiler analysis, optimization, and programming models for high-performance computing. He has contributed to the Titanium programming language and has been one of the developers of the UPC++ library for high-performance computing in C++. He earned his Ph.D. in Computer Science from UC Berkeley in 2012.
Katherine Yelick, Paul Hilfinger, Susan Graham, Dan Bonachea, Jimmy Su, Amir Kamil, Kaushik Datta, Phillip Colella, and Tong Wen, "Parallel Languages and Compilers: Perspective from the Titanium Experience", The International Journal Of High Performance Computing Applications, August 1, 2007, 21(3):266-290, doi: 10.1177/1094342007078449
We describe the rationale behind the design of key features of Titanium—an explicitly parallel dialect of JavaTM for high-performance scientific programming—and our experiences in building applications with the language. Specifically, we address Titanium’s Partitioned Global Address Space model, SPMD parallelism support, multi-dimensional arrays and array-index calculus, memory management, immutable classes (class-like types that are value types rather than reference types), operator overloading, and generic programming. We provide an overview of the Titanium compiler implementation, covering various parallel analyses and optimizations, Titanium runtime technology and the GASNet network communication layer. We summarize results and lessons learned from implementing the NAS parallel benchmarks, elliptic and hyperbolic solvers using Adaptive Mesh Refinement, and several applications of the Immersed Boundary method.
John Bachan, Dan Bonachea, Paul H Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, Scott Baden, "The UPC++ PGAS library for exascale computing", PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017, November 12, 2017, doi: 10.1145/3144779.3169108
We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.
D. Ozog, A. Kamil, Y. Zheng, P. Hargrove, J. Hammond, A. Malony, W.A. de Jong, K. Yelick, "A Hartree-Fock Application using UPC++ and the New DArray Library", 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 23, 2016, doi: 10.1109/IPDPS.2016.108
Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick, "Implementing High-Performance Geometric Multigrid Solver With Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015,
- Download File: pgas15-hpgmg.pdf (pdf: 803 KB)
Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874
- Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)
Amir Kamil, Yili Zheng, Katherine Yelick, "A Local-View Array Library for Partitioned Global Address Space C++ Programs", ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, June 2014,
Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.
Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, Katherine Yelick, "UPC++: A PGAS Extension for C++", International Parallel and Distributed Processing Symposium (IPDPS), May 2014,
Amir Kamil, Katherine Yelick, "Hierarchical Computation in the SPMD Programming Model", 26th International Workshop on Languages and Compilers for Parallel Computing, September 2013,
Large-scale parallel machines are programmed mainly with the single program, multiple data (SPMD) model of parallelism. While this model has advantages of scalability and simplicity, it does not fit well with divide-and-conquer parallelism or hierarchical machines that mix shared and distributed memory. In this paper, we define the recursive single program, multiple data model (RSPMD) that extends SPMD with a hierarchical team mechanism to support hierarchical algorithms and machines. We implement this model in the Titanium language and describe how to eliminate a class of deadlocks by ensuring alignment of collective operations. We present application case studies evaluating the RSPMD model, showing that it enables divide-and-conquer algorithms such as sorting to be elegantly expressed and that team collective operations increase performance of conjugate gradient by up to a factor of two. The model also facilitates optimizations for hierarchical machines, improving scalability of particle in cell by 8x and performance of sorting and a stencil code by up to 40% and 14%, respectively.
Amir Kamil, Katherine Yelick, "Enforcing Textual Alignment of Collectives Using Dynamic Checks", 22nd International Workshop on Languages and Compilers for Parallel Computing, October 2009,
Many parallel programs are written in a single-program, multipledata (SPMD) style, in which synchronization is provided using collective operations that all threads execute simultaneously. If these operations are not properly aligned on all threads, deadlock can occur, and many compiler analyses and optimizations that depend on proper alignment fail. In this paper, we discuss the flaws in the Titanium language’s type system for enforcing textual alignment of collectives. We then present a system that uses runtime checks to ensure alignment for two definitions of textual alignment. The system instruments the code to keep track of alignment in each thread and then checks that alignment matches prior to performing a collective operation. We have implemented the system in the Titanium compiler, verifying that it catches alignment errors. We tested its performance on multiple application programs, demonstrating that the checks have no appreciable impact on execution time.
Amir Kamil, Katherine Yelick, "Hierarchical Pointer Analysis for Distributed Programs", The 14th International Static Analysis Symposium (SAS 2007), August 2007,
We present a new pointer analysis for use in shared memory programs running on hierarchical parallel machines. The analysis is motivated by the partitioned global address space languages, in which programmers have control over data layout and threads and can directly read and write to memory associated with other threads. Titanium, UPC, Co-Array Fortran, X10, Chapel, and Fortress are all examples of such languages. The novelty of our analysis comes from the hierarchical machine model used, which captures the increasingly hierarchical nature of modern parallel machines. For example, the analysis can distinguish between pointers that can reference values within a thread, within a shared memory multiprocessor, or within a network of processors. The analysis is presented with a formal type system and operational semantics, articulating the various ways in which pointers can be used within a hierarchical machine model. The hierarchical analysis has several applications, including race detection, sequential consistency enforcement, and software caching. We present results of an implementation of the analysis, applying it to data race detection, and show that the hierarchical analysis is very effective at reducing the number of false races detected.
Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, "Productivity and Performance Using Partitioned Global Address Space Languages", Parallel Symbolic Computation (PASCO'07), July 2007, doi: 10.1145/1278177.1278183
Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of Java T M designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.
Amir Kamil, Jimmy Su, Katherine Yelick, "Making Sequential Consistency Practical in Titanium", SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing, November 2005,
The memory consistency model in shared memory parallel programming controls the order in which memory operations performed by one thread may be observed by another. The most natural model for programmers is to have memory accesses appear to take effect in the order specified in the original program. Language designers have been reluctant to use this strong semantics, called sequential consistency, due to concerns over the performance of memory fence instructions and related mechanisms that guarantee order. In this paper, we provide evidence for the practicality of sequential consistency by showing that advanced compiler analysis techniques are sufficient to eliminate the need for most memory fences and enable high-level optimizations. Our analyses eliminated over 97% of the memory fences that were needed by a naive implementation, accounting for 87 to 100% of the dynamically encountered fences in all but one benchmark. The impact of the memory model and analysis on runtime performance depends on the quality of the optimizations: more aggressive optimizations are likely to be invalidated by a strong memory consistency semantics. We consider two specific optimizations pipelining of bulk memory copies and communication aggregation and scheduling for irregular accesses and show that our most aggressive analysis is able to obtain the same performance as the relaxed model when applied to two linear algebra kernels. While additional work on parallel optimizations and analyses is needed, we believe these results provide important evidence on the viability of using a simple memory consistency model without sacrificing performance.
Amir Kamil, Katherine Yelick, "Concurrency Analysis for Parallel Programs with Textually Aligned Barriers", Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, October 2005,
A fundamental problem in the analysis of parallel programs is to de- termine when two statements in a program may run concurrently. This analysis is the parallel analog to control flow analysis on serial programs and is useful in detecting parallel programming errors and as a precursor to semantics-preserving code transformations. We consider the problem of analyzing parallel programs that access shared memory and use barrier synchronization, specifically those with textually aligned barriers and single-valued expressions. We present an intermediate graph representation for parallel programs and an efficient interprocedural analysis algorithm that conservatively computes the set of all concurrent statements. We improve the precision of this algorithm by using context-free language reachability to ignore infeasible program paths. We then apply the algorithms to static race detection and show that it can benefit from the concurrency information provided.
Amir Kamil, Managing Hierarchy with Teams in the SPMD Programming Model, Workshop on Programming Abstractions for Data Locality (PADAL'14), April 28, 2014,
The single program, multiple data (SPMD) model of parallelism is the dominant programming model for large-scale distributed-memory machines. Its simple structure maps well to such machines: it exposes the actual degree of available parallelism, leads to good locality, and can be implemented by efficient runtime systems. However, its simplicity also makes it difficult to manage hierarchy, both at the algorithmic level (e.g. divide-and-conquer algorithms) and in addressing the communication characteristics of hierarchical machines. In this talk, we present a hierarchical team mechanism that allows SPMD programs to manage hierarchy. We show that it allows divide-and-conquer algorithms such as sorting to be expressed in SPMD and that it enables optimizations for hierarchical machines, increasing the scalability and/or performance of multiple benchmarks. We also explore how hierarchical teams may prove useful in other programming abstractions, such as expressing hierarchical distribution of data.
Amir Kamil, Katherine Yelick, Three Challenges and Three Solutions for Exascale Computing, NSF Workshop on Research Directions in the Principles of Parallel Computing, June 2012,
Modern high performance machines look increasingly different from those in the past. They are more hierarchical, with non-uniform memory access within a node and even within a single socket, resulting in a wider range of communication costs. They consist of heterogeneous computational elements, providing different performance and capabilities at different energy costs. Fault-tolerance is a growing concern due to a trade-off between failure rates and power use at the chip level, combined with a growing number of components in large scale systems. In this talk, we discuss three approaches to these challenges, focusing on machine hierarchy. The first is to expose the problem directly to the user in the programming model, and we present the hierarchical partitioned global address space (HPGAS) and recursive single-program, multiple-data (RSPMD) models that do so for machine hierarchy. Other solutions include using compiler analysis to automatically tackle the problem and building domain-specific libraries that hide it from the application programmer. We briefly discuss the latter two approaches, as well as some open questions in handling the three problems of hierarchy, heterogeneity, and resilience.
John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer’s Guide, v1.0-2018.3.0", Lawrence Berkeley National Laboratory Tech Report, March 31, 2018, LBNL 2001136, doi: 10.2172/1430693
UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.
John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian van Straalen,, "UPC++ Specification v1.0, Draft 6", Lawrence Berkeley National Laboratory Tech Report, March 26, 2018, LBNL 2001135, doi: 10.2172/1430689
UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.
John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer’s Guide, v1.0-2017.9", Lawrence Berkeley National Laboratory Tech Report, September 29, 2017, LBNL 2001065, doi: 10.2172/1398522
This document has been superseded by: UPC++ Programmer’s Guide, v1.0-2018.3.0 (LBNL-2001136)
UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.
John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian van Straalen,, "UPC++ Specification v1.0, Draft 4", Lawrence Berkeley National Laboratory Tech Report, September 27, 2017, LBNL 2001066, doi: 10.2172/1398521
This document has been superseded by: UPC++ Specification v1.0, Draft 6 (LBNL-2001135)
UPC++ is a C++11 library providing classes and functions that support Asynchronous Partitioned Global Address Space (APGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.
Adrian Tate, Amir Kamil, Anshu Dubey, Armin Größlinger, Brad Chamberlain, Brice Goglin, Carter Edwards, Chris J. Newburn, David Padua, Didem Unat, Emmanuel Jeannot, Frank Hannig, Gysi Tobias, Hatem Ltaief, James Sexton, Jesus Labarta, John Shalf, Karl Fuerlinger, Kathryn O’Brien, Leonidas Linardakis, Maciej Besta, Marie-Christine Sawley, Mark Abraham, Mauro Bianco, Miquel Pericàs, Naoya Maruyama, Paul Kelly, Peter Messmer, Robert B. Ross, Romain Cledat, Satoshi Matsuoka, Thomas Schulthess, Torsten Hoefler, Vitus Leung, "Programming Abstractions for Data Locality", 2014 Workshop on Programming Abstractions for Data Locality, April 29, 2014,
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.
Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.
Dan Bonachea, Paul Hilfinger, Kaushik Datta, David Gay, Susan Graham, Amir Kamil, Ben Liblit, Geoff Pike, Jimmy Su, Katherine Yelick, "Titanium Language Reference Manual, Version 2.20", University of California, Berkeley Tech Report (UCB/EECS-2005-15.1), August 3, 2006, doi: 10.25344/S4H59R
The Titanium language is a Java dialect for high-performance parallel scientific computing. Titanium’s differences from Java include multi-dimensional arrays, an explicitly parallel SPMD model of computation with a global address space, a form of value class, and zone-based memory management. This reference manual describes the differences between Titanium and Java.
Single Program, Multiple Data Programming for Hierarchical Computations, Amir Kamil, PhD, August 2012,
As performance gains in sequential programming have stagnated due to power constraints, parallel computing has become the primary tool for increasing performance. Parallel computing has long been used in scientific computing, and programmers of the future will likely face many of the same challenges that occur in programming large-scale machines. One such challenge is that of hierarchy: machines are built in a hierarchical fashion, with a wide range of communication costs between different parts of a machine, and applications such as divide-and-conquer algorithms often have hierarchical structure. Large-scale parallel machines are programmed primarily with the single program, multiple data (SPMD) model of parallelism. This model combines independent threads of execution with global collective communication and synchronization operations. Previous work has demonstrated the advantages of SPMD over other models: its simplicity enables productive programming and avoids many classes of parallel errors, and at the same time it is easy to implement and amenable to compiler analysis and optimization. Its local-view execution model allows programmers to take advantage of data locality, resulting in good performance and scalability on large-scale machines. However, it is a flat model that does not fit well with hierarchical machines or algorithms. In this dissertation, we introduce the recursive single program, multiple data (RSPMD) execution model. This model extends SPMD with hierarchical, structured teams, or groupings of threads. We design RSPMD extensions for the Titanium language, including a hierarchical team data structure and lexically-scoped constructs for operating over teams. We demonstrate that these extensions prevent erroneous use of teams that would result in deadlock. In addition, we present a runtime mechanism for ensuring proper use of both global collective operations and collectives over teams, eliminating more potential sources of deadlock. As analyzable as SPMD is, we demonstrate that RSPMD can also be analyzed precisely and efficiently. We define a hierarchical pointer analysis for determining which data a pointer can reference, as well as on which threads the referenced data may reside. We then present a series of analyses for computing the set of concurrent statements in both SPMD and RSPMD programs. We show that these analyses improve the results of multiple client analyses, including data-locality and sharing inference, race detection, and memory-model enforcement. Finally, we present application case studies demonstrating the expressiveness and performance of the RSPMD model. We show that the model enables divide-and-conquer algorithms such as sorting to be elegantly expressed, and that team collective operations increase performance of a conjugate gradient benchmark by up to a factor of two. The model also facilitates optimizations for hierarchical machines, improving scalability of a particle in cell application by 8x, performance of sorting by up to 40%, and execution time of a stencil code by as much as 14%.
Analysis of Partitioned Global Address Space Programs, Amir Kamil, M.S., December 2006,
The introduction of multi-core processors by the major microprocessor vendors has brought parallel programming into the mainstream. Analysis of parallel languages is critical both for safety and optimization purposes. In this report, we consider the specific case of languages with barrier synchronization and global address space abstractions. Two of the fundamental problems in the analysis of parallel programs are to determine when two statements in a program can execute concurrently, and what data can be referenced by each memory location. We present an efficient interprocedural analysis algorithm that conservatively computes the set of all concurrent statements, and improve its precision by using context-free language reachability to ignore infeasible program paths. In addition, we describe a pointer analysis using a hierarchical machine model, which distinguishes between pointers that can reference values within a thread, within a shared memory multiprocessor, or within a network of processors. We then apply the analyses to two clients, data race detection and memory model enforcement. Using a set of five benchmarks, we show that both clients benefit significantly from the analyses.