CORVETTE: Correctness Verification and Testing of Parallel Programs
The goal of this project is to provide tools to assess the correctness of parallel programs written using hybrid parallelism. There is a dire lack of both theoretical and engineering knowhow in the area of finding bugs in hybrid or large scale parallel programs, which our research aims to change. As intranode programming is likely to be the most disrupted by the transition to Exascale, we will emphasize support for a large spectrum of programming and execution models such as dynamic tasking, directive based programming, and data parallelism. For internode programming we plan to handle both onesided (PGAS) and twosided (MPI) communication paradigms.
We aim to provide tools that identify sources of nondeterminism in program executions and make concurrency bugs (data races, atomicity violations, deadlocks) and floatingpoint behavior reproducible. In order to increase the adoption of automatic program bug finding and exploration tools, novel techniques to increase precision and scalability are required. Precision implies that false alarms/positives are filtered and only the real problems are reported to users. During our research we will explore stateof theart methods that use dynamic program analysis. Since dynamic analysis monitors the program execution the resulting tools are precise, at the expense of scalability. Current approaches exhibit 10X 100X runtime overhead: it is our goal to provide precise tools with no more than 2X runtime overhead at large scale. We will also research techniques to maximize the tool efficacy on a time budget, e.g. no more than 10% overhead. These principles have been incorporated into the UPCThrille data race detection tool for the Unified Parallel C programming language.
We will also research novel approaches to assist with program debugging using the minimal amount of concurrency needed to reproduce a bug and to support twolevel debugging of highlevel programming abstractions (DSLs). Furthermore, we plan to apply the combination of techniques developed for bug finding to provide an environment for exploratory programming. We will develop tools that allow developers to specify their intentions (not implementation) for code transformations and that are able to provide feedback about correctness. Besides code transformations, we plan to allow for automatic algorithmic tuning, i.e. transparently choosing at runtime the best implementation with respect to a metric from a collection of algorithms. As an initial case study we have developed PRECIMONIOUS, a tool that determines program phases where double floatingpoint precision can be replaced by single precision.
Progress 2015
This document summarizes the project progress and status in 2015.
Software
Researchers

James Demmel (UCB)

Cindy Rubio Gonzalez (UCB)

Costin Iancu (LBL)

Anh Cuong Nguyen (UCB)

Koushik Sen (UCB)  Principal Investigator
Publications
2016
Cuong Nguyen, Cindy RubioGonzalez, Benjamin Mehne, Koushik Sen, Costin Iancu, James Demmel, William Kahan, Wim Lavrijsen, David H. Bailey, David Hough, "Floatingpoint precision tuning using blame analysis", 38th International Conference on Software Engineering (ICSE 2016), May 14, 2016, doi: 10.1145/2884781.2884850
 Download File: icse16blameanalysis.pdf (pdf: 1 MB)
While tremendously useful, automated techniques for tuning the precision of floatingpoint programs face important scalability challenges. We present Blame Analysis, a novel dynamic approach that speeds up precision tuning. Blame Analysis performs floatingpoint instructions using different levels of accuracy for their operands. The analysis determines the precision of all operands such that a given precision is achieved in the final result of the program. Our evaluation on ten scientific programs shows that Blame Analysis is successful in lowering operand precision. As it executes the program only once, the analysis is particularly useful when targeting reductions in execution time. In such case, the analysis needs to be combined with searchbased tools such as Precimonious. Our experiments show that combining Blame Analysis with Precimonious leads to obtaining better results with significant reduction in analysis time: the optimized programs execute faster (in three cases, we observe as high as 39.9% program speedup) and the combined analysis time is 9× faster on average, and up to 38× faster than Precimonious alone.
2015
Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, "OPR: Partial Deterministic Record and Replay for OneSided Communication", LBNL TR, April 17, 2015,
 Download File: main3.pdf (pdf: 295 KB)
James Demmel, Costin Iancu, Koushik Sen, "Corvette Progress Report 2015", April 1, 2015,
 Download File: corvetteprogress.pdf (pdf: 198 KB)
Milind Chabbi, Wim Lavrijsen, Wibe de Jong, Koushik Sen, John Mellor Crummey, Costin Iancu, "Barrier Elision for Production Parallel Programs", PPOPP 2015, February 5, 2015,
 Download File: nwbar.pdf (pdf: 663 KB)
2014
James Demmel, HongDiep Nguyen, "Parallel Reproducible Summation", IEEE Transactions on Computers, Special Section on Computer Arithmetic 2014, August 11, 2014, doi: 10.1109/TC.2014.2345391
Reproducibility, i.e. getting bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [10]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, makes attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for errorfree vector transformation [7], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm reproducibly computes highly accurate results with an absolute error bound of (formula) at a cost of 7n FLOPs and a small constant amount of extra memory usage. Higher accuracies are also possible by increasing the number of errorfree transformations. As long as all operations are performed in tonearest rounding mode, results computed by the proposed algorithms are reproducible for any run on any platform. In particular, our algorithm requires the minimum number of reductions, i.e. one reduction of an array of six double precision floating point numbers per sum, and hence is well suited for massively parallel environments.
2013
Cindy RubioGonzalez, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, David Hough, "Precimonious: Tuning Assistant for FloatingPoint Precision", Supercomputing 2013, November 17, 2013, doi: 10.1145/2503210.2503296
 Download File: sc13.pdf (pdf: 555 KB)
Given the variety of numerical errors that can occur, floatingpoint programs are difficult to write, test and debug. One common practice employed by developers without an advanced background in numerical analysis is using the highest available precision. While more robust, this can degrade program performance significantly. In this paper we present Precimonious, a dynamic program analysis tool to assist developers in tuning the precision of floatingpoint programs. Precimonious performs a search on the types of the floatingpoint program variables trying to lower their precision subject to accuracy constraints and performance goals. Our tool recommends a type instantiation that uses lower precision while producing an accurate enough answer without causing exceptions. We evaluate Precimonious on several widely used functions from the GNU Scientific Library, two NAS Parallel Benchmarks, and three other numerical programs. For most of the programs analyzed, Precimonious reduces precision, which results in performance improvements as high as 41%.
ChangSeo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", International Supercomputing Conference (ICS) 2013, 2013, doi: 10.1145/2464996.2465000
 Download File: thrilleexp4.pdf (pdf: 744 KB)
Contemporary and future programming languages for HPC promote hybrid parallelism and shared memory abstractions using a global address space. In this programming style, data races occur easily and are notoriously hard to find. Existing stateoftheart data race detectors exhibit 10X100X performance degradation and do not handle hybrid parallelism. In this paper we present the first complete implementation of data race detection at scale for UPC programs. Our implementation tracks local and global memory references in the program and it uses two techniques to reduce the overhead: 1) hierarchical function and instruction level sampling; and 2) exploiting the runtime persistence of aliasing and locality specific to Partitioned Global Address Space applications. The results indicate that both techniques are required in practice: well optimized instruction sampling introduces overheads as high as 6500% (65X slowdown), while each technique in separation is able to reduce it only to 1000% (10X slowdown). When applying the optimizations in conjunction our tool finds all previously known data races in our benchmark programs with at most 50% overhead when running on 2048 cores. Furthermore, while previous results illustrate the benefits of function level sampling, our experiences show that this technique does not work for scientific programs: instruction sampling or a hybrid approach is required.
James Demmel, HongDiep Nguyen, "Fast Reproducible FloatingPoint Summation", Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), April 10, 2013, doi: 10.1109/ARITH.2013.9
Reproducibility, i.e. getting the bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [1]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, make attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for errorfree vector transformation [2], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm trades off efficiency and accuracy: we reproducibly attain reasonably accurate results (with an absolute error bound c · n ^{2} · macheps · max v _{i}  for a small constant c) with just 2n + O(1) floatingpoint operations, and quite accurate results (with an absolute error bound c · n ^{3} · macheps ^{2} · max v _{i}  with 5n + O(1) floating point operations, both with just two reduction operations. Higher accuracies are also possible by increasing the number of errorfree transformations. As long as the same rounding mode is used, results computed by the proposed algorithms are reproducible for any run on any platform.
James Demmel, HongDiep Nguyen, Numerical Accuracy and Reproducibility at Exascale, Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), April 10, 2013,
 Download File: pres_33.pdf (pdf: 300 KB)
ChangSeo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", Principles and Practice of Parallel Programming (PPoPP 2013), March 4, 2013,
 Download File: thrilleexp5.pdf (pdf: 200 KB)
2012
"Corvette Handout  XStack KickOff 2012", 2012,
 Download File: DOEHandoutfinal.pdf (pdf: 42 KB)
"Corvette Project Description  XStack KickOff", 2012,
 Download File: Corvettegoalsxstack.pdf (pdf: 187 KB)