# **Scaling Data Race Detection for Partitioned Global Address Space Programs**

Chang-Seo Park, Koushik Sen **EECS** Department University of California, Berkeley {parkcs,ksen}@cs.berkeley.edu

Costin Iancu Computational Research Division Lawrence Berkeley National Laboratory cciancu@lbl.gov

### **ABSTRACT**

Contemporary and future programming languages for HPC promote hybrid parallelism and shared memory abstractions using a global address space. In this programming style, data races occur easily and are notoriously hard to find. Existing state-of-the-art data race detectors exhibit  $10 \times -100 \times$  performance degradation and do not handle hybrid parallelism. In this paper we present the first complete implementation of data race detection at scale for UPC programs. Our implementation tracks local and global memory references in the program and it uses two techniques to reduce the overhead: 1) hierarchical function and instruction level sampling; and 2) exploiting the runtime persistence of aliasing and locality specific to Partitioned Global Address Space applications. The results indicate that both techniques are required in practice: well optimized instruction sampling introduces overheads as high as 6500% (65× slowdown), while each technique in separation is able to reduce it only to 1000% (10× slowdown). When applying the optimizations in conjunction our tool finds all previously known data races in our benchmark programs with at most 50% overhead when running on 2048 cores. Furthermore, while previous results illustrate the benefits of function level sampling, our experiences show that this technique does not work for scientific programs: instruction sampling or a hybrid approach is required.

# **Categories and Subject Descriptors**

D.2.5 [Software]: Software Engineering—Testing and Debugging

### **Keywords**

Data Race, Tracing, Sampling, Instrumentation Overhead

### INTRODUCTION

Attaining good performance and efficacy on contemporary and future large scale High Performance Computing systems requires using hybrid programming models: OpenMP+MPI, UPC+MPI, Intel TBB + MPI or OpenMP+UPC. With multiple levels, intra-node parallelism is usually exploited

Copyright 2013 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. ICS'13, June 10-14, 2013, Eugene, Oregon, USA.

Copyright 2013 ACM 978-1-4503-2130-3/13/06 ...\$10.00.

using shared memory programming models, while inter-node parallelism is exploited using message passing or shared memory abstractions.

Bugs due to non-deterministic execution and conflicting memory accesses are fairly common and notoriously hard to detect in a parallelism rich environment. Previous work demonstrates the ability of dynamic program analyses to find concurrency bugs (data race [33], atomicity violations [22], deadlock [7]) in shared memory programs. Dynamic program analyses have been also used to find heisenbugs in distributed memory programs: DAMPI [39] for MPI wildcard receives and UPC-Thrille [29] for data races in Unified Parallel C [10].

Data race detectors for shared memory programming [31, 33] trace individual memory references (load/store instructions) and reason about program semantics using a centralized analysis. The implementations are optimized to reduce the instrumentation overhead and reportedly slow down the execution by less than 10× at small concurrency. Bug finding for distributed memory programming models is made scalable by using a distributed analysis, but the current approaches illustrated by DAMPI and UPC-Thrille [29, 39] track only calls into communication libraries. Tools that can handle hybrid programming have not been demonstrated yet. Furthermore, while acceptable when testing programs on workstations, the current overhead of dynamic program analyses is hard to stomach at the contemporary HPC concurrencies of tens of thousands of cores. Large scale analyses face the additional challenge to provide the lowest achievable overhead while still providing good coverage.

In this paper we present the first complete dynamic analysis for distributed memory programs able to track both memory references and communication calls. Our main design goal is attaining very low overhead at scale. We extend the UPC-Thrille data race detection tool with tracking of individual memory references and validate it using implementations of the NAS Parallel Benchmarks [6], as well as other fine-grained dynamic programming and tree search applications. UPC-Thrille, detailed in Section 2, implements a dynamic program analysis where the program is executed with additional instrumentation and data about memory accesses, communication and synchronization operations is gathered and analyzed.

Performance degradation caused by instrumentation is a well recognized challenge for dynamic program analyses. The most widely used technique to reduce overhead is sampling [3. 4, 23, 17, 39, 29] of the program execution. Tools for shared memory use instruction level sampling while the distributed

memory tools [39, 29] implement its equivalent by sampling the communication operations. For shared memory, Marino et al [23] recently introduced LiteRace which coarsens the granularity of the sampling at function boundaries. LiteRace showed better scalability and coverage than instruction level sampling when applied on several Microsoft programs, as well as Apache and Firefox.

We have experimented with both instruction level sampling and function level sampling. Contrary to LiteRace, the results presented in Section 3 indicate that instruction level sampling performs better than function level sampling for scientific programs. Instruction level sampling adds runtime overhead as high as  $65\times$  while many runs using function level sampling did not terminate, even when instrumenting only the first execution of a function.

The scalability and efficacy of our tool is determined solely by the instrumentation overhead. We have performed experiments to determine a "default" sampling frequency that provides both coverage and low performance degradation. Our results indicate that manipulating the sampling frequency is not enough to attain acceptable slowdown on the workload under consideration.

We propose a combination of techniques to minimize the tool overhead. In Section 4.1 we describe how to use program semantic information such as aliasing to prune the number of events traced. In Section 4.2 we propose a hierarchical sampling approach where the instrumentation granularity is dynamically controlled both at the function level and at the instruction level. Applying each technique in isolation is able to reduce the maximum tool overhead from  $65\times$  with instruction sampling to roughly  $10\times$ . When combining aliasing based pruning with hierarchical sampling, we were able to reduce the maximum overhead from  $10\times$  (i.e. 1000%) to only 50% while finding the same races using 2048 cores. We believe that our findings are widely applicable to any tool for data race detection in Partitioned Global Address Space languages: Chapel, Titanium, Co-Array Fortran. This paper makes the following contributions:

- We develop the first complete dynamic analysis for distributed memory programs which is able to track both memory references and communication calls with significantly low overhead.
- We show that state-of-the-art sampling techniques for data race detection do not scale for scientific programs.
- We propose a hierarchical sampling technique which is able to reduce instrumentation overhead from 65× to 10×.
- We propose a novel aliasing based pruning heuristics, which when combined with hierarchical sampling, further reduces overhead from  $10 \times$  to  $0.5 \times$ .

# 2. UNIFIED PARALLEL C

UPC is a parallel extension of the ISO C programming language for high performance computing. UPC uses the Single-Program-Multiple-Data (SPMD) programming model and provides a Partitioned Global Address Space: each task has access to a private address space and to a global shared address space. The language extends the C type system with the qualifier of pointer-to-shared to denote accesses to the global address space. Pointers-to-

shared can be casted to proper C pointers, but not viceversa. This is widely used in practice for performance reasons and for calling into native libraries such as Intel MKL. In addition, the language provides synchronization primitives (lock, barrier), bulk memory transfers (memput, memget), as well as a memory consistency [41, 21] model.

Together with Chapel, X10, Co-Array Fortran and Titanium, UPC belongs to the family of Partitioned Global Address Space languages. These PGAS languages distinguish between local and global references and provide support for logical data layouts, such as block-cyclic array distribution. As a result, they implement complicated memory management and a reference to a global object is orders of magnitude slower [18] than a load/store instruction due to complex addressing rules.

# 2.1 Finding Data Races with UPC-Thrille

A data race occurs in parallel programs when two threads access the same memory location with no ordering constraints between them, and at least one of the accesses is a write [1, 27].

In our previous work, we developed a data race detector for UPC called UPC-Thrille. UPC-Thrille was able to find only data races between accesses performed using variables with the pointer-to-shared data type (e.g. shared int \*) and communication calls (e.g. memget). It instruments communication calls and implements an active testing [19] methodology which works in two phases:

- A predictive analysis phase which uses a distributed lock-set-based algorithm [33, 29] to identify data races. The data races reported by the predictive analysis could be real or infeasible. The results reported show that UPC-Thrille is able to predict potential data races with good scalability and overhead lower than 15%.
- A confirmation phase, where the program is reexecuted under a controlled schedule that attempts to make the potential races manifest, i.e. create a state by manipulating the schedule where two threads are about to access the same memory location and at least one of the accesses is a write. Races that are not reproduced are unlikely to be false positives.

Well optimized UPC programs usually cast pointers-to-shared (e.g. shared int \*) to C proper pointers (e.g. int\*) and the released UPC-Thrille cannot detect data races introduced by memory aliases. Furthermore, the presence on non-blocking communication operations [8] introduces another class of data races. As non-blocking communication is a "background" asynchronous activity that can be overlapped with computation, memory accesses within a task can race with the communication operations initiated by the same task. A complete data race detection solution needs to track all the memory references, including those using C pointers, as well as communication calls. Therefore, our released UPC-Thrille was prone to false-negatives, i.e. it could miss to report real data races during an execution.

In the rest of this paper we discuss the implementation and optimizations required for the complete UPC-Thrille data race detector. The tool is publicly available under a BSD license.

# 3. THE OVERHEAD OF DATA RACE DETECTION

Instrumentation overhead is recognized as a problem that dynamic race detectors have to address. Commercial tools for C programs such as the Intel Thread Checker or the Sun Thread Analyzer, usually instrument all load/store instructions at the expense of  $600\times$  execution slowdown [36] on scientific OpenMP programs with small memory footprints. Average overheads on other scientific programs for the Intel Thread Checker have been reported [32] around  $200\times$  and as high as  $485\times$ .

Sampling techniques have been introduced by Arnold and Ryder [3] and later adopted in other bug finding tools [4, 23] for parallel programs. The efficacy of these techniques is determined by the granularity of the instrumented code region and the sampling strategy. Tools [4] for finding bugs in programs running on managed runtimes (e.g. Java) tend to use instruction level sampling; the additional instrumentation overhead is not perceived as unacceptable since the runtime already manages object metadata and access. These systems usually observe up to  $3X \times$  slowdowns for non-scientific applications and data is not available for HPC applications. Distributed memory tools such as DAMPI [39] sample only communication calls.

Recently, Marino et al [23] proposed a technique to coarsen the sampling control from instruction level to function level. They use a compiler to generate instrumented and un-instrumented versions of functions and select the appropriate copy at runtime. The instrumented version of a function monitors every memory reference during its execution. Their LiteRace tool introduces up to 3× overhead while providing good coverage on non-scientific programs; it has not been evaluated on scientific programs. In the rest of this paper we refer to this technique as function sampling. One reason that function sampling outperforms instruction sampling is that it amortizes better the cost of tracking memory references: function sampling executes one branch/decision per application level function call while instruction sampling executes one branch/decision per instruction traced.

Several sampling strategies have been proposed and evaluated for non-scientific programs. Random sampling has been shown to provide poor coverage. SWAT [17] detects memory leaks and uses an approach where the execution of code segments is sampled at a rate inversely proportional to their execution frequency. LiteRace uses a bursty sampler, where the execution of a function is sampled initially at a 100% rate and the sample rate is progressively reduced until it reaches a lower bound. Both approaches try to give priority to regions of code rarely executed and give priority to the first execution of any code region.

The implementation of UPC-Thrille described in [29] instruments memget/memput communication operations with a bursty sampling similar to LiteRace.

In order to provide a complete data race detection solution we have modified UPC-Thrille and the Berkeley UPC compiler to track all memory references, including all references through C proper pointers. We provide a well optimized implementation of instruction sampling that makes extensive use of C macro-definitions to eliminate function call overheads for the instrumentation code. Every memory reference is examined using a bursty sampling strategy.

We have also implemented function sampling with the same bursty strategy.

For any sampled memory reference, the implementation has to check whether the address resides within a thread's private address space or within the global address space. This requires integration with the UPC runtime memory management module and it is an expensive operation, common to PGAS languages. References to the private address space are ignored as they cannot race. Global references are inserted into the UPC-Thrille internal data structures and further checked against other references.

We distinguish three types of overhead: 1) instrumentation overhead is introduced by the checks to prune the non-interesting data accesses; 2) computation overhead is introduced by the operations on internal data structures to manage the interesting accesses; and 3) communication overhead introduced by the exchange of conflicting accesses between tasks. Thus, private references contribute only to instrumentation overhead while global references also contribute to the computation and communication overhead.

#### 3.1 Benchmarks

We evaluate UPC benchmarks using fine-grained and bulk communication. Table 1 presents statistics about the benchmark characteristics. For implementations using bulk communication primitives we use the NAS Parallel Benchmarks (NPB) [25, 26, 6], releases 2.4 and 3.3. We have performed experiments with the problem classes A, B and C and D; overall, the memory footprint of the workload varies from tens of MBs to tens of GBs. Asanović et al [5] examined six different promising domains for commercial parallel applications and report that a surprisingly large fraction of them use methods encountered in the scientific domain. In particular, all methods used in the NAS benchmarks (multigrid, sparse-matrix operations, sorting, Fast Fourier Transformation, dense linear algebra) appear in at least one commercial domain. Thus, beside their HPC relevance, these benchmarks are of interest to other communities.

The fine-grained benchmarks reflect the type of communication/synchronization that is present in larger applications during data structure initializations, dynamic load balancing, or remote event signaling. The *guppie* benchmark performs random read/modify/write accesses to a large distributed array, a common operation in parallel hash table construction, graph algorithms and data mining. The amount of work is static and evenly distributed among tasks at execution time. The *psearch* benchmark performs parallel unbalanced tree search [28]. The benchmark is designed to be evaluation tool for dynamic load balancing strategies.

The selected programs provide a good sample of different programming and software engineering styles, dynamic application behavior and scalability characteristics. The NAS benchmarks contain many function calls and have a structure common to any large application. The fine-grained benchmarks contain few or no user defined function calls, a structure common to many scientific libraries and their unit testing. In the NAS benchmarks, the ratio of local memory accesses to communication calls performed at runtime is large  $(O(10^5)$  or above), while in the fine grained benchmarks they are roughly equal. The NAS benchmarks implement iterative methods, while the code in the fine-grained benchmarks represents a "direct" solve executed only once.

Due to the randomness of the access pattern and the fre-

|         |      |            |              |         | O      | verhead |       |       |
|---------|------|------------|--------------|---------|--------|---------|-------|-------|
| Bench   | LoC  | Runtime(s) | #Races       | NL      | HA.5   | IA      | FA0   | I     |
| guppie  | 271  | 19.070     | 2(2) + 0(0)  | 54.9%   | 54.2%  | 53.7%   | DNF   | 74.9% |
| psearch | 803  | 0.697      | 3(1) + 2(2)  | 2.48%   | 10.8%  | 666%    | 8.01% | 6490% |
| BT 3.3  | 9698 | 189.48     | 7(0) + 3(1)  | 0.574%  | 1.16%  | 77.6%   | DNF   | -     |
| CG 2.4  | 1654 | 39.573     | 0(0) + 1(1)  | 1.09%   | 27.6%  | 57.6%   | DNF   | 2579% |
| EP 2.4  | 678  | 54.453     | 0(0) + 0(0)  | -0.618% | 0.805% | 2.09%   | 4.74% | 111%  |
| FT 2.4  | 2289 | 62.663     | 2(2) + 0(0)  | 0.601%  | 30.1%  | 121%    | DNF   | 2744% |
| IS 2.4  | 1426 | 5.130      | 0(0) + 0(0)  | 0.376%  | 119%   | 159%    | DNF   | 1201% |
| LU 3.3  | 6348 | 155.997    | 0(0) + 44(2) | -0.425% | -      | 75.7%   | DNF   | -     |
| MG 2.4  | 2229 | 18.687     | 2(2) + 4(0)  | 0.336%  | 176%   | 632%    | DNF   | 2020% |
| SP 3.3  | 5740 | 247.937    | 10(0) + 3(1) | 0.160%  | 0.861% | 29.1%   | DNF   | -     |

**Table 1:** Statistics for the NAS Parallel Benchmarks class C, guppie and psearch running on 16 cores. We report the races found as A(B) + C(D), where A represents the number of races detected by the original UPC-Thrille tool with B of them confirmed, and C represents the additional number of races detected with our extensions with D of them confirmed through phase II. Some execution overheads are omitted (-), due to configuration errors.

quency of execution, these benchmarks require the tool to provide good program coverage. NAS FT exhibits a race between the initialization code and the subsequent computation: initialization is executed only once. *guppie* performs random updates to a global table and it exhibits two races: read-write and write-write. *psearch* implements a work stealing strategy that exhibits random data races.

# 3.2 Comparison of Function and Instruction Sampling

The experimental results are obtained on a Cray XE6 system composed of nodes containing two twelve-core AMD MagnyCours 2.1 GHz processors. The system has two nodes attached to a Gemini network interface card, forming an overall 3-D torus network with 6,384 nodes. The network is providing a bandwidth of 9.375 GBytes/sec per direction in 10 directions. The maximum injection bandwidth per node is 20GB/s. Our implementation extends UPC-Thrille as contained in the Berkeley UPC release 2.14.2.

Table 2 describes the experiment labeling scheme and summary of results is shown in Table 1. Instruction sampling is denoted by  $\mathbf{I}$  and we use the setting of 0.9 instruction back-off factor as our baseline, as indicated in [29]. Function sampling is denoted by  $\mathbf{F}$ . For a given setting of the sampling frequency for  $\mathbf{I}$ , we are interested in determining a sampling frequency for  $\mathbf{F}$  that provides similar coverage.

We will use the behavior of the CG benchmark to illustrate the differences between the different configurations. These CG trends are representative for the whole suite of benchmarks we examined. Figure 1 presents the tool performance when applied to CG classes A and D running on 16 and 2048 cores respectively. The benchmark implements an iterative method and Class A solves a problem with a small memory footprint (MBs) in few iterations, while class D solves a large (GBs) problem. For reference, the original UPC-Thrille tool adds 8% runtime overhead when instrumenting only communication calls (labeled as **NL** in the graphs for No-Local). Our implementation finds one new race in the implementation of this benchmark when compared to the original UPC-Thrille.

Instruction level sampling  ${\bf I}$  of all memory references adds a 3600% overhead to the CG class A benchmark execution. Function level sampling  ${\bf F.5}$  introduces a 2900% overhead, lower than the 3600% overhead of  ${\bf I}$ . For problem larger than class A, function level sampling exhibits very large overheads or the execution does not finish (DNF): some runs exhaust

the available memory while some were manually terminated after observing  $1000\times$  slowdown.

A comparison of the overhead breakdown for F and I illustrates the fundamental differences between the two methods. I introduces almost all overhead (3600%) in instrumentation, with less than 3\% for computation and communication overhead combined, while **F.5** adds only 112% instrumentation overhead. This large difference validates the common intuition that function level sampling amortizes better the cost of deciding what references to track. On the other hand, F.5 exhibits a large 2800% computation overhead to record and reason about the memory references that are actually tracked. The computation overhead for I is very small at less than 2%. This behavior is explained by the temporal distribution of tracked memory accesses during the program execution. UPC-Thrille uses a combination of lock-set based and happens-before analysis that requires tracking all memory references between two barrier statements. Function level sampling exhibits a clustered behavior, where many memory references are tracked for a short period of time. Instruction sampling spreads the tracking of memory references more evenly over the program execution. Thus, the behavior of function sampling is determined by the scalability of the tool internal data structures, while the behavior of instruction sampling is determined the speed of "classifying" a memory access.

# 3.3 Instrumentation and Computation Overheads

Previous work on data race detection focuses on word-level memory accesses and require only keeping track of conflicting addresses. These tools usually use hash table data structures internally. For scientific programs with bulk communication operations (PGAS or MPI), data races on full memory ranges can occur during execution. UPC-Thrille uses an efficient Interval Skiplist [16] data structure to represent memory ranges that demonstrated good performance when sampling only communication operations.

As the performance of function sampling is clearly hampered by the internal data structure overhead, we evaluate the scalability using micro-benchmarks for the insertion and search operations. The time complexity of these algorithms is dependent on the number of elements in the data structure and the distribution of the intervals. We evaluate performance across a range of list sizes and interval distributions: sequential, reverse sequential, strided and uniform random.

| Format             | S[A][F]                                                                     |  |  |
|--------------------|-----------------------------------------------------------------------------|--|--|
| Sampling           | H: hierarchical, I: instruction-level, F: function-level                    |  |  |
|                    | NL: no instrumentation on local accesses                                    |  |  |
| $\mathbf{A}$ lias  | Indicates the use of the persistence alias heuristic                        |  |  |
| $\mathbf{F}$ actor | Back-off factor for sampling at function level                              |  |  |
| Example            | HA.5: Hierarchical sampling with alias heuristic and back-off factor of 0.5 |  |  |

**Table 2:** Key for the experiment naming scheme.





**Figure 1:** Breakdown of data race detection overhead for the CG class A benchmark running on 16 cores and class D running on 2048 cores. The **F** and **FA** configurations did not finish for the class D experiment. At the mid-point **HA.5** the probability of sampling a function invocation decays from 1 to 0, by 0.5 each time a function invocation is instrumented.

Sequential streams are often encountered in code that performs data structure initialization, and are present in all of our benchmarks. They are also the holy-grail of cache optimizations. Strided accesses occur in the Fast Fourier Transform code NAS FT, while random accesses of the form a[b[i]] appear in sparse methods of NAS CG and sorting in NAS IS, as well as *guppie*. For a real-world perspective, we also measure the average number of memory intervals that are recorded in our benchmarks.

Figures 2 and 3 present the measured performance on one core of the Cray XE6 system. For a uniform random distribution of 20,000 ranges, the average insert time is 12  $\mu$ s and the average search time is 1.3  $\mu$ s. For a more regular distribution of ranges such as a sequential one (e.g. [0, 10), [10, 20), [20, 30), ...), the insertion and search times were higher at 114  $\mu$ s and 2.4  $\mu$ s, respectively. This is a weakness of the Interval Skiplist which relies on randomness of data for balancing link levels. In practice, the effect can be easily offset by adding some irregularity, such as the implementation inserting random benign addresses with low frequency. In the application benchmark, the memory access stream does have irregularity, and as illustrated by the results for MG inserts are on average 45  $\mu$ s and searches 0.54  $\mu$ s.

When using instruction sampling for the application benchmarks, the Interval Skip-lists never grew too large. They remained at under 1000 unique ranges, thus the insert and search times of the Interval Skiplist do not contribute largely to the overhead. On the other hand, when using function sampling the data structures grew above  $10^6$  entries, at which point we stopped the execution due to the very large overheads already accumulated.

Instruction sampling pays a higher cost for classifying a memory reference but it naturally throttles the number of references recorded at any time. Function sampling performs a fast classification while having to record a large number of references. Reference classification has a constant

overhead independent of the number of references already recorded, while recording overhead scales with the number of references. This difference explains why function sampling scales worse than instruction sampling for scientific programs. For reference, when running on the Cray XE6, the average instrumentation overhead per reference is 1ns, the average memory classification is 45ns, the average computation overhead per reference is 500ns while the average communication overhead per reference is  $60\mu s$ .

# 3.4 Sampling for Scientific Applications

Our results indicate that for the workload considered function sampling is not a feasible strategy. Function sampling is indeed faster than instruction sampling for problems using small datasets, such as class A of the NAS Parallel Benchmarks. When increasing the data set size to B, C and D, function sampling in any flavor does not terminate, while the highest slowdown introduced by instruction sampling is  $65\times$ . From all benchmarks considered, the only exception happens for *psearch* and EP where **F** is roughly twice as fast than **I**. *psearch* is a tree search benchmark which performs a constant and small amount of work per function, independent of the problem size: this is a common characteristic to many commercial applications. EP is an "Embarrassingly Parallel" benchmark with no global memory accesses.

Our results contradict the intuition that function sampling scales better than instruction sampling. The performance reversal is caused by having only coarse grained control over instrumentation: as loops within scientific applications execute billions of references, function sampling tracks billions of references. In the PGAS case, this is compounded by the fact that classifying the locality of a reference is expensive.

As we have evaluated codes widely used in the scientific domain using small and large datasets running at small and large concurrency, we believe that our finding is widely applicable. In contrast, previous shared memory data race



**Figure 2:** Average time for the insert operation in Interval Skiplist.

detectors [33, 23, 31, 14] have been scaled at most up to 16 cores and on applications using small data sets. LiteRace is validated on a four core system, while the tool presented by Raman et al [31] has been scaled up to 16 cores.

# 4. TECHNIQUES TO REDUCE OVER-HEAD

As the slowdown of data race detection is caused by instrumentation, additional techniques are required when low overhead is mandatory. This can be achieved by: 1) reducing the number of events traced with better filtering techniques; and 2) better control over the instrumentation granularity.

The most obvious approach is lowering the sampling frequency. We have performed a search over this space, detailed results omitted for brevity. Our results indicate that acceptable behavior at scale cannot be attained solely by tuning the sampling frequency and additional optimizations to reduce the number of references tracked are required for scientific applications. We propose two such techniques that improve scalability without sacrificing analysis precision.

The first optimization filters events by exploiting program semantic information such as aliasing. We improve the analysis performance by exploiting the insight that *aliases are persistent* in PGAS programs: once one is created it will point in the same region (private or global) for a long period of time. Using this we can eliminate the overhead introduced by looking up the physical memory layout inside the language runtime.

The second optimization reduces overhead by providing more control over the instrumentation granularity. We use a hierarchical sampling approach that combines function and instruction sampling to amortize the instrumentation cost while retaining fine grained control over the number of events sampled.

# 4.1 Exploiting the Persistence of Locality

PGAS languages, such as UPC, Titanium, CAF, Chapel and X10, provide the abstraction of a shared memory address space. Data residing in this space is accessible through references to variables that have a particular type, e.g. "pointer-to-shared" type in UPC or "global" in Titanium<sup>1</sup>.

The memory management inside any PGAS language runtime is complex due to the need to provide globally addressable memory and to support data layouts, e.g. block cyclic



**Figure 3:** Average time for the search operation in Interval Skiplist.

layouts. Thus, references to global object and pointer arithmetic operations translate into runtime operations orders of magnitude [18] more expensive than operations on local references, through a C pointer in the UPC case. Application developers aggressively cast global references to local for performance reasons, including the need to call external libraries such as Intel MKL.

For every local memory reference, the data race detection code needs to perform the inverse up-cast operation and check whether the address is globally visible. This operation is also orders of magnitude more expensive than a regular memory load/store.

For portability reasons, we implement a runtime approach that does not require compiler support. We limit the number of runtime up-casts using the intuition that aliases/locality are persistent. Once initialized, a local reference will access only the private space or only the global space during the program execution. This assumption allows the analysis to dynamically determine the "locality" of any reference only once and cache the result for the rest of the execution. In our implementation, we add a shadow variable to cache the locality of every memory reference expression.

The persistence of locality assumption is valid in all of our test programs and it does not decrease the precision of the analysis. Eight of the ten benchmarks (except EP and guppie) have C pointers aliased to the global heap and the tool finds the same races found by an analysis that "checks" every local memory reference. These races are shown in Table 1 in the column '#Races as "+ C (D)" .

When the underlying assumption is not valid during the execution, the heuristic as implemented may lead to false negatives (miss real data races). However, as casts of global references are complicated and are implemented as runtime "calls", the technique can be trivially generalized for programs with a more dynamic behavior. Any runtime casting call can be extended to invalidate the cached locality information. For casting of proper C pointers, the compiler can easily generate the invalidation calls at the operation site. The performance of this extension is determined by the ratio of casts to memory references performed by the program at runtime. The additional overhead for realistic programs is likely to be negligible in practice

# 4.2 Hierarchical Sampling

For every memory reference there are two sources of runtime overhead. Instrumentation overhead is introduced to decide whether the reference should be recorded and compu-

<sup>&</sup>lt;sup>1</sup>Actually, in Titanium any reference is global by default and the language provides local qualifiers.

tation overhead is introduced when recording the reference in the tool internal data structures. By reducing the instruction sampling rate one can clearly reduce overhead, but at the expense of program coverage. To provide both low overhead and good coverage we propose a hierarchical sampling approach which combines the fine grained control of instruction sampling with the overhead amortization provided by function sampling. By using a good hierarchical sampling strategy, we can reduce the instrumentation overhead while retaining the ability to sample from a diverse context with less redundancy. Using the concept of code regions, we formally define instrumentation and hierarchical sampling.

DEFINITION 1 (CODE REGIONS). We inductively define code regions. By definition, the smallest unit of a code region is a memory reference (read or write). A code region is a reference or a sequence of one or more code regions. The entire program is the largest code region. Each code region R has a label, denoted as #R.

Functions, loop bodies, basic blocks etc. are examples of code regions. We assume structured code, i.e. that all code regions are properly nested.

DEFINITION 2 (REGION STACK). During program execution, a region stack RS is maintained. Similar to a call stack, when a region is entered, the label of the region #R is pushed to RS. When exiting a region, the last label is popped from the stack. At the beginning of a program execution, RS is initially empty.

Definition 3 (Instrumentation). Instrumentation is a transformation of a code region  $R \to R^{inst}$ . If R is a memory reference (base case)

$$R^{inst} = \begin{array}{c} \quad \textbf{if} \ \ check\text{-}reference(\#R :: RS) \ \textbf{then} \\ \quad \mid \quad \log(\#R) \end{array}$$

Else, if R is a sequence of regions  $[R_1, R_2, \ldots, R_n]$ ,

$$R^{inst} = \begin{array}{c} \textbf{if} \ check\text{-}region(\#R :: RS) \ \textbf{then} \\ RS = \#R :: RS; \\ [R_1^{inst}, R_2^{inst}, \dots, R_n^{inst}]; \\ RS = tail(RS) \\ \textbf{else} \\ [R_1, R_2, \dots, R_n] \end{array}$$

By specializing the *check-reference* and *check-region* and choosing the region granularity, we can implement multiple sampling algorithms. For example, instruction sampling with an exponential back-off (strategy I in the experiments presented in Section 3.2), is implemented as the following functions. The map  $p: label \to \mathbb{R}$  contains the (dynamic) sampling probabilities of regions.

```
\forall \#R \in Statements. \ p(\#R) = 1.0
check\text{-reference}(\#R :: RS) = 
if \ rand() < p(\#R) \ then
\mid p(\#R) *= BACKOFF\_FACTOR;
\mid \text{return } true
else
\mid \text{return } false
check\text{-region}(x) = true
```

Function sampling as introduced by the LiteRace [23] implementation is defined as follows. The region is a whole

function and the *sample-strategy* function depends on the strategy of sampling, such as a fixed probability, random or an adaptive strategy.

```
check-reference(x) = true

check-region\#R :: RS) = sample-strateqy(\#R)
```

Intuitively, the *check-reference* function decides what events should be logged at runtime, while the *check-region* function provides control over the granularity of these decisions. We propose a hierarchical sampling strategy that combines instruction sampling with function sampling. The combination of hierarchical sampling with the aliasing runtime heuristic is referred to as **HA** and described as:

 $\forall \#R \in Statements \cup Functions. \ p(\#R) = 1.0$ 

```
check-reference(\#R :: RS) =
if p > 0 \land rand() < p(\#R) then
   if is-local-access(R) then
      // locality persistence heuristic
      p(\#R) = 0;
      return false:
   else
      p(\#R) *= STMT\_BACKOFF\_FACTOR;
      return true;
else
return false
check-region(\#R :: RS) =
if p > 0 \land rand() < p(\#R) then
   p(\#R) *= FUNC\_BACKOFF\_FACTOR;
   return true
else
return false
```

This implementation uses exponential back-off at both individual reference and function granularity.

# 4.3 Scalable Data Race Detection

In the following, the letter **A** in the configuration name denotes composing the aliasing heuristic to that particular sampling method. We denote the configuration using hierarchical instruction sampling by **H**, where we control instrumentation at the function and the instruction level. Instructions are sampled with the baseline 0.9 value for **I**, while the numbers in the title denote the function back-off factor. Thus, **H1** is identical to **I** (always samples functions), while with **H0** we sample only the first invocation of any function. At the mid-point **H.5** the probability of sampling a function invocation decays from 1 by 0.5 each time the function is sampled; for long running programs the sampling probability converges to 0. Selected results are presented in Table 1 and Figures 1 and 4.

Composing the aliasing optimization with any of the tool instrumentation methods greatly improves performance. As illustrated in Figure 1, the overhead of instruction sampling is reduced from 3600% to 105% with **IA** for CG class A. Similar trends are observable when scaling the problem and running class D on 2048 cores. For this particular configuration, the **FA** method does not terminate due to out of memory errors or excessive slowdown. **I** exhibits a 259% overhead, while **IA** exhibits less than 15% slowdown.

Composing aliasing with function and instruction sampling does not change the overall trends. For problems using small datasets, such as class A of the NAS Parallel Bench-

marks, **F** or **FA** is faster than **I** or **IA**, respectively. When increasing the data set size to B, C and D, **FA** still does not terminate. Overall, the maximum **IA** slowdown is  $10\times$ .

Hierarchical sampling  ${\bf H}$  performs better than both instruction and function sampling. For CG,  ${\bf H.5}$  exhibits a 2550% slowdown, significantly better than both  ${\bf F}$  and  ${\bf I}$ . Overall, with hierarchical sampling we still observe slowdowns as high as  $20\times$ , which is still unacceptable when running at scale. With the hierarchical sampling approach, the instrumentation overhead contributes the most to the program slowdown.

Composing the aliasing heuristic with the hierarchical sampling strategy provides best performance. For CG, the overhead of hierarchical sampling is reduced from 2550% with **H.5** to 99% with **HA.5** and from 294% with **H0** to 17% with **HA0**. In the case of the NAS Parallel Benchmarks class C on 16 cores, the weighted average overhead for all the benchmarks with **HA.5** was 11.9%.

Figure 4 shows the performance when running strong scaling experiments for the classes C and D of the NAS Parallel Benchmarks. For all experiments, the lowest overhead is introduced by the  ${\bf HA}$  configuration and we are able to find all the races with less than 50% runtime overhead when running up to 2048 cores.

Overall, instrumentation overhead contributes the most to the slowdown caused by data race detection. The computation overhead in the scalable versions of **IA** and **HA** is small. At large scale the communication overhead is also small due to the techniques presented in [29].

# 5. BUGS FOUND

The column labeled '#Races' in Table 1 shows the data races found by our tool. The data is presented as A(B) + C(D), where A(B) denotes races found by the original UPC-Thrille and C(D) the *additional* C races found in phase I and reproduced (D) in phase II by this new implementation. All the additional races are uncovered due to the ability to track global to local memory aliases.

We detect a previously unknown race in NAS CG where memory is initialized using "local" pointers and distributed without synchronization to other tasks using global pointers. The races reproduced in *psearch*, BT, SP and LU are benign and are intended by the programmer. In *psearch* they occur on a counter used for work-stealing, while the races confirmed in BT, SP and LU occur when executing custom synchronization code:

signal 
$$(v = 1)$$
; || wait  $(while(v == 0))$ ;

The races reported as potential but not reproduced in BT, SP and LU occur between memory accesses separated by the custom synchronization code. Note that reporting infeasible data races in the presence of custom synchronization code is a common limitation of data race detection tools. Our 2nd phase tries to remove this limitation—it never confirms the feasibility of these data races. This is a possible indication that these data races are false positives.

### 6. DISCUSSION

We believe that our techniques and findings are widely applicable to other languages or programming models. The alias heuristic is applicable to any PGAS language since they provide a global address space and performance and software engineering concerns require programmers to aggressively

identify references that are local to a given "task". Hierarchical sampling is a generic technique orthogonal to the language, programming model employed in the application, or the data race detection algorithm. It is clearly required for SPMD parallelism (UPC, CAF, MPI) or OpenMP parallel loops, where work per function scales with the problem size. When using structured parallelism as present in Habanero-C or OpenMP tasking, some applications may perform a constant amount of work per function and the overall behavior approximates commercial applications or that of our psearch benchmark. In this case, hierarchical sampling performs at least as well as function sampling. The benefits of hierarchical sampling are also orthogonal to the choice of data race detection algorithm: lock-set-based or happens-before.

In PGAS languages, global addresses have associated object metadata and casting and address arithmetic operations require calls into the language runtime. Thus, a runtime only implementation of the aliasing heuristic works well. Static analysis may be able to complement or supplant our runtime analysis approach. Besides the obvious engineering challenges to implement alias and escape analysis in a multilanguage, multi-programming model environment, there is a practical expectation on performance improvements. The implementation of our alias heuristic adds one extra branch instruction per memory reference. Static analysis will eliminate some of these branches. Qualitatively, this is similar to the effects of function sampling when compared to instruction sampling. In our experiments, function sampling outperformed instruction sampling by at most  $3\times$ .

As our hierarchical approach considers only two granularities at function and instruction level, an interesting question is whether further refinement improves performance. One obvious solution is to consider loop nests as the unit of instrumentation. Our tool did not finish when instrumenting only the first execution of any function, which indicates that any strategy that instruments whole loops is likely to exhibit high overhead. Thus, a strategy that peels off few instrumented loop iterations is required to limit slowdown. In our opinion, any such strategy is likely to produce equivalent behavior to the hierarchical sampling we implemented.

Performance can be further improved by reducing computation overhead with data structures with better scalability characteristics than Interval Skiplist. The results provide little incentive as for our workload instrumentation overhead directly determines slowdown.

For our future work we plan to extend the data race detection implementation to provide maximum coverage on a time budget: our goal is to find the maximum number of data races with no more than a guaranteed application slowdown. Our preliminary experiences indicate that we are likely to be able to guarantee no more than  $2 \times$  slowdown. Toward controlling time, the scalability analysis of the internal data structures has already vielded valuable insights which allows us to derive space/state bounds. Toward improving coverage, we plan to use the concept of region stacks introduced in the formalism presented in Section 4.2. We plan to experiment with several other strategies besides exponential back-off at reference and function level: i) proportional sampling per unique region stack; ii) k-region context sampling (similar to k-CFA [34] and k-object-sensitive [24] analyses; and iii) proportional sampling at functions and exponential back-off at statements.



Figure 4: Scalability of the different sampling methods when running the tool on the NAS Parallel Benchmarks, classes C and D. The overhead of instruction sampling  $\mathbf{I}$  is very high compared to the others and it has been omitted for presentation purposes.

# OTHER RELATED WORK

Data race detection tools can be broadly classified as using static or dynamic techniques. Static techniques [30, 9] are scalable and complete, i.e. find all the races in the program. On the other hand, they report a very large number of false positives which need to be filtered by users and can handle only limited types of synchronization primitives such as locks or barriers.

Dynamic techniques for finding concurrency bugs can be classified into two classes: predictive techniques and precise techniques Predictive dynamic techniques [33, 11] could predict concurrency bugs that did not happen in a concurrent execution; however, such techniques still report false warnings. UPC-Thrille implements a predictive dynamic technique, followed by automatic filtering of false positives. Precise dynamic techniques, such as happens-before race detection [2, 12, 14] and atomicity monitoring [22, 15], are capable of detecting concurrency bugs that actually happen in an execution. Therefore, these techniques are precise, but they cannot give good coverage as predictive dynamic techniques.

Dynamic techniques have to address the challenge of high runtime overhead. Sampling approaches to reduce instrumentation overhead have been discussed throughout this paper. Techniques to reduce the computation overhead have been explored as well. Choi et al [11] discuss static analysis techniques to reduce the overhead of data race detection for Java programs. As alias and pointer analysis for C based programs is notoriously conservative, these techniques need to be supplemented by the runtime techniques presented in Section 4.1. Recently, Raman et al [31] describe a scalable implementation for data race detection in Habanero Java programs implemented using fine-grained structured parallelism. Their benchmarks are equivalent to our fine-grained benchmarks, while our NAS benchmarks use coarse grained interactions. They report analysis overheads as high as  $10 \times$ and provide valuable data about the scalability of other state of the art race detectors for multi-threaded programs: Eraser [33] and FastTrack [14]. They report slowdowns as high as  $100 \times$  for the latter.

So far there have been a lot of research effort to verify and test concurrent and parallel programs written in Java and C/pthreads for non-HPC platforms; the literature listed above supports this fact. There have also been efforts to test and verify HPC programs, mostly focused on C/MPI programs. ISP [37] is a push-button dynamic verifier capable of detecting deadlocks, resource leaks, and assertion violations in C/MPI programs. DAMPI [39, 40] overcomes ISP's scalability limitations and scales to thousands of MPI processes. Like ISP, DAMPI only tests for MPI Send/Recy interleavings, but runs in a distributed way. In contrast to our work, DAMPI instruments and reasons only about the ordering of Send/Recv operations with respect to the MPI ranks, and not about the memory accessed by these operations. Both ISP and DAMPI assume that program input is fixed. TASS [35] removes this limitation by using symbolic execution to reason about all possible inputs to a MPI program, but it is work only at inception. MPI messages can be intercepted and analyzed for bugs and anomalies. Intel MessageChecker [13] does a post-mortem analysis after collecting message traces, while MARMOT [20] and Umpire [38] check at runtime.

# **CONCLUSION**

To our knowledge, we discuss the first implementation of a data race detector for distributed memory programs that tracks both memory references and communication operations. The main design goal of our implementation is to provide low overhead with good program coverage when running at scale.

Dynamic program analysis tools face the challenge of instrumentation overhead and sampling techniques have been shown to be effective in reducing program slowdown. The state-of-the-art technique to reduce overhead is considered to be function sampling. We use a workload containing UPC programs and experiment with function and instruction level sampling. Our results indicate that function level sampling is not feasible for scientific programs: increasing the input set increases the amount of work per function invocation in these applications and the analysis does not terminate. Instruction sampling works better for scientific programs and our implementation finds races with up to  $65 \times$  slowdown.

In order to obtain acceptable program slowdown we experiment with tuning the sampling frequency. For the workload considered, the performance degradation for settings required to find all the "known" bugs is still high.

We propose two techniques to improve the scalability of data race detection: 1) hierarchical function and instruction level sampling; and 2) exploiting the runtime persistence of aliasing and locality in UPC applications. The results indicate that both techniques are required in practice: any stand-alone technique is able to reduce overhead only to 1000% (10× slowdown). When applying the optimizations in conjunction our tool finds races with at most 50% overhead when running on 2048 cores of a CrayXE6 system.

# **Acknowledgements**

Support for this work was provided through the X-Stack program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research under collaborative agreement number DE-SC0008699 and through a gift from Oracle. Additional support to LBNL was provided by the U.S. Department of Defense. This research also supported in part by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227).

- **9. REFERENCES**[1] S. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. IEEE computing, December 1996.
- S. V. Adve, M. D. Hill, B. P. Miller, and R. H. B. Netzer. Detecting Data Races on Weak Memory Systems. In 18th International Symposium on Computer architecture (ISCA), pages 234–243. ACM,
- [3] M. Arnold and B. G. Ryder. A Framework for Reducing the Cost of Instrumented Code. In Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, PLDI '01, 2001.
- [4] M. Arnold, M. T. Vechev, and E. Yahav. QVM: An Efficient Runtime for Detecting Defects in Deployed Systems. ACM Trans. Softw. Eng. Methodol., 21(1),
- K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick.

- The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.
- [6] D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, A. Woo, and M. Yarrow. The NAS Parallel Benchmarks 2.0. Technical Report NAS-95-010, NASA Ames Research Center, 1995.
- [7] S. Bensalem and K. Havelund. Dynamic Deadlock Analysis of Multi-threaded Programs. In *Haifa Verification Conference*, pages 208–223, 2005.
- [8] D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet. Technical Report LBNL-56495, Lawrence Berkeley National Lab, 2004.
- [9] S. Burckhardt, R. Alur, and M. M. K. Martin. CheckFence: Checking Consistency of Concurrent Data Types on Relaxed Memory Models. In Programming Language Design and Implementation (PLDI), pages 12–21, 2007.
- [10] W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, and K. W. E. Brooks. Introduction to UPC and Language Specification, 1999.
- [11] J.-D. Choi, K. Lee, A. Loginov, R. O'Callahan, V. Sarkar, and M. Sridharan. Efficient and precise datarace detection for multithreaded object-oriented programs. In *Programming language design and implementation (PLDI)*, pages 258–269, New York, NY, USA, 2002. ACM.
- [12] J.-D. Choi, B. P. Miller, and R. H. B. Netzer. Techniques for Debugging Parallel Programs With Flowback Analysis. ACM Trans. Program. Lang. Syst., 13(4):491–530, 1991.
- [13] J. DeSouza, B. Kuhn, B. R. de Supinski, V. Samofalov, S. Zheltov, and S. Bratanov. Automated, scalable debugging of MPI programs with Intel Message Checker. In Software engineering for high performance computing system applications, SE-HPCS '05, pages 78–82, 2005.
- [14] C. Flanagan and S. N. Freund. FastTrack: efficient and precise dynamic race detection. In *Programming* language design and implementation (PLDI), 2009.
- [15] C. Flanagan, S. N. Freund, and J. Yi. Velodrome: a sound and complete dynamic atomicity checker for multithreaded programs. In *Programming language* design and implementation (PLDI), pages 293–303. ACM, 2008.
- [16] E. N. Hanson and T. Johnson. The interval skip list: A data structure for finding all intervals that overlap a point. In Workshop on Algorithms and Data Structures, pages 153–164. Springer, 1992.
- [17] M. Hauswirth and T. M. Chilimbi. Low-overhead memory leak detection using adaptive statistical profiling. In Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, ASPLOS-XI, 2004.
- [18] P. Husbands, C. Iancu, and K. Yelick. A performance analysis of the berkeley upc compiler. In *Proceedings* of the 17th annual international conference on Supercomputing, ICS '03, 2003.
- [19] P. Joshi, M. Naik, C.-S. Park, and K. Sen. An extensible active testing framework for concurrent programs. In *Computer Aided Verification (CAV)*, Lecture Notes in Computer Science. Springer, 2009.
- [20] B. Krammer, M. Müller, and M. Resch. Runtime checking of MPI applications with MARMOT. In Mini-Symposium Tools Support for Parallel Programming, ParCo 2005, Malaga, Spain.
- [21] W. Kuchera and C. Wallace. The UPC memory model: Problems and prospects. In the 18th International Parallel and Distributed Processing Symposium (IPDPS), April 2004.

- [22] S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: detecting atomicity violations via access interleaving invariants. SIGARCH Comput. Archit. News, 34(5):37–48, 2006.
- [23] D. Marino, M. Musuvathi, and S. Narayanasamy. LiteRace: Effective Sampling for Lightweight Data-Race Detection. In *PLDI*, 2009.
- [24] A. Milanova. Parameterized object sensitivity for points-to analysis for java. ACM Trans. Softw. Eng. Methodol, 14:2005, 2005.
- [25] The NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/Software/NPB.
- [26] The UPC NAS Parallel Benchmarks. Available at http://upc.gwu.edu/download.html.
- [27] R. H. B. Netzer and B. P. Miller. What are race conditions? some issues and formalizations. *LOPLAS*, 1992.
- [28] S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In *International Conference on Parallel Processing (ICPP)*, 2008.
- [29] C.-S. Park, K. Sen, P. Hargrove, and C. Iancu. Efficient Data Race Detection for Distributed Memory Parallel Programs. In *Proceedings of the* Supercomputing Conference (SC11), 2011.
- [30] S. Qadeer and D. Wu. KISS: keep it simple and sequential. In *Programming language design and* implementation (PLDI), pages 14–24. ACM, 2004.
- [31] R. Raman, J. Zhao, V. Sarkar, M. Vechev, and E. Yahav. Scalable and Precise Dynamic Datarace Detection for Structured Parallelism. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), 2012.
- [32] P. Sack, B. E. Bliss, Z. Ma, P. Petersen, and J. Torrellas. Accurate and efficient filtering for the intel thread checker race detector. In *Proceedings of* the 1st workshop on Architectural and system support for improving software dependability, ASID '06, 2006.
- [33] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391–411, 1997.
- [34] O. Shivers. Control-Flow Analysis of Higher-Order Languages. PhD thesis, 1991.
- [35] S. F. Siegel and T. K. Zirkel. Automatic formal verification of MPI-based parallel programs. In Principles and practice of parallel programming, PPoPP '11, 2011.
- [36] C. Terboven. Comparing intel thread checker and sun thread analyzer. In PARCO'07, pages 669–676, 2007.
- [37] S. S. Vakkalanka, S. Sharma, G. Gopalakrishnan, and R. M. Kirby. ISP: a tool for model checking MPI programs. In *Principles and practice of parallel* programming, PPoPP '08, 2008.
- [38] J. S. Vetter and B. R. de Supinski. Dynamic software testing of MPI applications with Umpire. In Supercomputing, SC '00, Washington, DC, USA, 2000. IEEE Computer Society.
- [39] A. Vo, S. Aananthakrishnan, G. Gopalakrishnan, B. R. d. Supinski, M. Schulz, and G. Bronevetsky. A Scalable and Distributed Dynamic Formal Verifier for MPI Programs. In *Supercomputing (SC10)*, 2010.
- [40] A. Vo, G. Gopalakrishnan, R. M. Kirby, B. R. de Supinski, M. Schulz, and G. Bronevetsky. Large scale verification of mpi programs using lamport clocks with lazy update. In *Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques*, PACT '11, pages 330–339, Washington, DC, USA, 2011. IEEE Computer Society.
- [41] K. Yelick, D. Bonachea, and C. Wallace. A Proposal for a UPC Memory Consistency Model. Technical Report LBNL-54983, Lawrence Berkeley National Laboratory, May 2004.