Costin Iancu

Costin Iancu

Senior Staff Scientist

Computer Science Department

Phone: +1 510 495 2122

Fax: +1 510 486 6900

I am performing research in the areas of programming models and code optimization for large-scale parallel systems. I tend to favor simple and practical designs. Over the years I've been involved in multiple projects.

Big Data Support on HPC Systems. (Intel Parallel Computing Center): There exists an incentive in the HPC community to run commercial data analytics frameworks (Hadoop, Spark) on capability systems. Adoption so far has been hampered by design decisions that are commercial data-center specific, e.g. local disk available. We plan to explore R&D for technologies to enable Spark to run on large-scale HPC systems. The IPCC is part of our larger research agenda and focuses on improving the interaction between Spark and the Lustre parallel global file system on production systems at NERSC. First notable accomplishment is that we show how to scale Spark from 100 cores to 10.000 cores on production HPC systems.
CORVETTE (Correctness, Verification and Testing of Parallel Programs): I'm interested in tools that assist developers in bug finding or program state exploration for debugging or optimization purposes. The focus is on developing a scalable dynamic program analysis framework for the HPC programming models
Berkeley UPC I wrote a lot of the compiler code and moved on to code optimizations, where I've been successful when using a combination of static and dynamic program analyses to guide communication and task scheduling optimizations. To guide optimizations, I'd rather use "qualitative" performance models over the more traditional approaches: track derivatives and preserve ordering rather than predict absolute values.
THOR (Throughput Oriented Runtime): In this project, we are trying to build a software overlay network to perform on-the-fly communication optimizations for large-scale systems. There are three dimensions of control: concurrency, order, and granularity. We have developed mechanisms to regulate the concurrency of communication operations, avoid congestion, scheduling/reordering for improved throughput, etc.

The individual project pages contain more information: slides, posters, software, fun ...

Journal Articles

Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720

Nicholas Chaimov, Khaled Ibrahim, Samuel Williams, Costin Iancu, "Exploiting Communication Concurrency on High Performance Computing Systems", IJHPCA, April 17, 2015,

Download File: thorserv2.pdf (pdf: 1.7 MB)

Khaled Z. Ibrahim, Steven Hofmeyr, Costin Iancu, "The Case for Partitioning Virtual Machines on Manycore Architectures", IEEE TPDS, April 17, 2014,

Download File: TPDSsubmit.pdf (pdf: 5.6 MB)

S Hofmeyr, J Colmenares, J Kubiatowicz, C Iancu, "Juggle: Addressing Extrinsic Load Imbalances in SPMD Applications on Multicore Computer", Cluster Computing, 2012,

Conference Papers

Mathias Weiden, Justin Kalloor, John Kubiatowicz, Ed Younis, Costin Iancu, "Wide Quantum Circuit Optimization with Topology Aware Synthesis", Third International Workshop on Quantum Computing Software, November 13, 2022,

Unitary synthesis is an optimization technique that can achieve optimal gate counts while mapping quantum circuits to restrictive qubit topologies. Synthesis algorithms are limited in scalability by their exponentially growing run times. Application to wide circuits requires partitioning into smaller components. In this work, we explore methods to reduce depth and multi-qubit gate count of wide, mapped quantum circuits using synthesis. We present TopAS, a topology aware synthesis tool that preconditions quantum circuits before mapping. Partitioned subcircuits are optimized and fitted to sparse subtopologies to balance the opposing demands of synthesis and mapping algorithms. Compared to state of the art wide circuit synthesis algorithms, TopAS is able to reduce depth on average by 35.2% and CNOT count by 11.5% for mesh topologies. Compared to the optimization and mapping algorithms of Qiskit and Tket, TopAS is able to reduce CNOT counts by 30.3% and depth by 38.2% on average.

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

Ethan H. Smith, Marc G. Davis, Jeffery M. Larson, Costin Iancu, "LEAP: Scaling Numerical Optimization Based Synthesis Using an Incremental Approach", International Workshop of Quantum Computing Software at Supercomputing, November 2020,

Download File: LEAP-Heuristics-for-Near-Optimal-Synthesis-Targeting-NISQ-Devices.pdf (pdf: 125 KB)

Marc G. Davis, Ethan Smith, Ana Tudor, Koushik Sen, Irfan Siddiqi, Costin Iancu, "Towards Optimal Topology Aware Quantum Circuit Synthesis", 2020 IEEE International Conference on Quantum Computing and Engineering (QCE), Denver, CO, USA, IEEE, October 12, 2020, doi: 10.1109/QCE49297.2020.00036

We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As CNOT gates are error-prone for the foreseeable Noisy-Intermediate-Scale Quantum devices era, our A* inspired algorithm minimizes their count while accounting for connectivity. We discuss the formulation of synthesis as a search problem as well as an algorithm to find solutions. For a workload of circuits with complexity appropriate for the NISQ era, we produce solutions well within the best upper bounds published in literature and match or exceed hand tuned implementations, as well as other existing synthesis alternatives. In particular, when comparing against state-of-the-art available synthesis packages we show 2.4× average (up to 5.3×) reduction in CNOT count. We also show how to re-target the algorithm for a different chip topology and native gate set while obtaining similar quality results. We believe that tools like ours can facilitate algorithmic exploration and guide gate set discovery for quantum processor designers, as well as being useful for optimization in the quantum compilation tool-chain.

Marc Grau Davis, Ethan Smith, Ana Tudor, Koushik Sen, Irfan Siddiqi, Costin Iancu, "Heuristics for Quantum Compiling with a Continuous Gate Set", 3rd International Workshop on Quantum Compilation as part of the International Conference On Computer Aided Design 2019, December 5, 2019,

We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As accurate CNOT gates are hard for the foreseeable Noisy- Intermediate-Scale Quantum devices era, our A* inspired algorithm attempts to minimize their count, while accounting for connectivity. We discuss the search strategy together with metrics to expand the solution frontier. For a workload of circuits with complexity appropriate for the NISQ era, we produce solutions well within the best upper bounds published in literature and match or exceed hand tuned implementations, as well as other existing synthesis alternatives. In particular, when comparing against state-of-the-art available synthesis packages we show 2.4x average (up to 5.3x) reduction in CNOT count. We also show how to re-target the algorithm for a different chip topology and native gate set, while obtaining similar quality results. We believe that empirical tools like ours can facilitate algorithmic exploration, gate set discovery for quantum processor designers, as well as providing useful optimization blocks within the quantum compilation tool-chain.

Cuong Nguyen, Cindy Rubio-Gonzalez, Benjamin Mehne, Koushik Sen, Costin Iancu, James Demmel, William Kahan, Wim Lavrijsen, David H. Bailey, David Hough, "Floating-point precision tuning using blame analysis", 38th International Conference on Software Engineering (ICSE 2016), May 14, 2016, doi: 10.1145/2884781.2884850

Download File: icse16-blame-analysis.pdf (pdf: 1 MB)

While tremendously useful, automated techniques for tuning the precision of floating-point programs face important scalability challenges. We present Blame Analysis, a novel dynamic approach that speeds up precision tuning. Blame Analysis performs floating-point instructions using different levels of accuracy for their operands. The analysis determines the precision of all operands such that a given precision is achieved in the final result of the program. Our evaluation on ten scientific programs shows that Blame Analysis is successful in lowering operand precision. As it executes the program only once, the analysis is particularly useful when targeting reductions in execution time. In such case, the analysis needs to be combined with search-based tools such as Precimonious. Our experiments show that combining Blame Analysis with Precimonious leads to obtaining better results with significant reduction in analysis time: the optimized programs execute faster (in three cases, we observe as high as 39.9% program speedup) and the combined analysis time is 9× faster on average, and up to 38× faster than Precimonious alone.

Nicholas Chaimov, Allen Malony, Shane Canon, Costin Iancu, Khaled Ibrahim, Jay Srinivasan, "Scaling Spark on HPC Systems", High Performance and Distributed Computing (HPDC), February 5, 2016,

Download File: spark-on-cray.pdf (pdf: 2.6 MB)

Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, "SReplay: Deterministic Sub-Group Replay for One-Sided Communication", International Conference on Supercomputing (ICS), 2016, February 5, 2016,

Download File: iancu-ics16.pdf (pdf: 411 KB)

Wim Lavrijsen, Costin Iancu, Wibe Albert de Jong, Xin Chen, Karsten Schwan, "Exploiting Variability for Energy Optimization of Load Balanced Parallel Programs", EuroSys 2016, February 5, 2016,

Download File: iancu-eurosys16.pdf (pdf: 2.6 MB)

Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15-servers.pdf (pdf: 1.2 MB)

Milind Chabbi, Wim Lavrijsen, Wibe de Jong, Koushik Sen, John Mellor Crummey, Costin Iancu, "Barrier Elision for Production Parallel Programs", PPOPP 2015, February 5, 2015,

Download File: nwbar.pdf (pdf: 663 KB)

M Chabbi, W Lavrijsen, W De Jong, K Sen, J Mellor-Crummey, C Iancu, "Barrier elision for production parallel programs", Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, January 1, 2015, 2015-Jan:109--119, doi: 10.1145/2688500.2688502

Khaled Ibrahim, Paul Hargrove, Costin Iancu, Katherine Yelick, "A Performance Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect", IPDPS 2014, April 17, 2014,

Cindy Rubio-Gonzalez, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, David Hough, "Precimonious: Tuning Assistant for Floating-Point Precision", Supercomputing 2013, November 17, 2013, doi: 10.1145/2503210.2503296

Download File: sc13.pdf (pdf: 555 KB)

Given the variety of numerical errors that can occur, floating-point programs are difficult to write, test and debug. One common practice employed by developers without an advanced background in numerical analysis is using the highest available precision. While more robust, this can degrade program performance significantly. In this paper we present Precimonious, a dynamic program analysis tool to assist developers in tuning the precision of floating-point programs. Precimonious performs a search on the types of the floating-point program variables trying to lower their precision subject to accuracy constraints and performance goals. Our tool recommends a type instantiation that uses lower precision while producing an accurate enough answer without causing exceptions. We evaluate Precimonious on several widely used functions from the GNU Scientific Library, two NAS Parallel Benchmarks, and three other numerical programs. For most of the programs analyzed, Precimonious reduces precision, which results in performance improvements as high as 41%.

Chang-Seo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", International Supercomputing Conference (ICS) 2013, 2013, doi: 10.1145/2464996.2465000

Download File: thrille-exp4.pdf (pdf: 744 KB)

Contemporary and future programming languages for HPC promote hybrid parallelism and shared memory abstractions using a global address space. In this programming style, data races occur easily and are notoriously hard to find. Existing state-of-the-art data race detectors exhibit 10X-100X performance degradation and do not handle hybrid parallelism. In this paper we present the first complete implementation of data race detection at scale for UPC programs. Our implementation tracks local and global memory references in the program and it uses two techniques to reduce the overhead: 1) hierarchical function and instruction level sampling; and 2) exploiting the runtime persistence of aliasing and locality specific to Partitioned Global Address Space applications. The results indicate that both techniques are required in practice: well optimized instruction sampling introduces overheads as high as 6500% (65X slowdown), while each technique in separation is able to reduce it only to 1000% (10X slowdown). When applying the optimizations in conjunction our tool finds all previously known data races in our benchmark programs with at most 50% overhead when running on 2048 cores. Furthermore, while previous results illustrate the benefits of function level sampling, our experiences show that this technique does not work for scientific programs: instruction sampling or a hybrid approach is required.

Miao Luo, Dhabaleswar K. Panda, Khaled Z. Ibrahim, Costin Iancu, "Congestion avoidance on manycore high performance computing systems", International Conference on Supercomputing (ICS), 2012,

Download File: ics105-luo.pdf (pdf: 1.8 MB)

Seung-Jai Min, Costin Iancu and Katherine Yelick, "Hierarchical Work Stealing on Manycore Clusters", Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11), 2011,

Download File: task.pdf (pdf: 703 KB)

Chang-Seo Park, Koushik Sen, Paul Hargrove, Costin Iancu, "Efficient data race detection for distributed memory parallel programs", Supercomputing (SC), 2011,

Download File: upcthrille.pdf (pdf: 457 KB)

KZ Ibrahim, S Hofmeyr, C Iancu, "Characterizing the performance of parallel applications on multi-socket virtual machines", Proceedings - 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2011, 2011, 1--12, doi: 10.1109/CCGrid.2011.50

Download File: ccgrid11submit.pdf (pdf: 427 KB)

S Hofmeyr, JA Colmenares, C Iancu, J Kubiatowicz, "Juggle: Proactive load balancing on multicore computers", Proceedings of the IEEE International Symposium on High Performance Distributed Computing, 2011, 3--14, doi: 10.1145/1996130.1996134

KZ Ibrahim, S Hofmeyr, C Iancu, E Roman, "Optimized pre-copy live migration for memory intensive applications", Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, doi: 10.1145/2063384.2063437

Download File: VMMigrationsubmit.pdf (pdf: 791 KB)

Filip Blagojevic, Paul Hargrove, Costin Iancu, and Katherine Yelick,, "Hybrid PGAS runtime support for multicore nodes", Fourth Conference on Partitioned Global Address Space Programming Model (PGAS), October 2010, doi: 10.1145/2020373.2020376

Download File: paper2.pdf (pdf: 1 MB)

With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.

Costin Iancu, Steven Hofmeyr, Filip Blagojević, Yili Zheng, "Oversubscription on multicore processors", Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), April 2010, doi: 10.1109/IPDPS.2010.5470434

Download File: ovsub.pdf (pdf: 449 KB)

Existing multicore systems already provide deep levels of thread parallelism; hybrid programming models and composability of parallel libraries are very active areas of research within the scientific programming community. As more applications and libraries become parallel, scenarios where multiple threads compete for a core are unavoidable. In this paper we evaluate the impact of task oversubscription on the performance of MPI, OpenMP and UPC implementations of the NAS Parallel Benchmarks on UMA and NUMA multi-socket architectures. We evaluate explicit thread affinity management against the default Linux load balancing and discuss sharing and partitioning system management techniques. Our results indicate that oversubscription provides beneficial effects for applications running in competitive environments. Sharing all the available cores between applications provides better throughput than explicit partitioning. Modest levels of oversubscription improve system throughput by 27% and provide better performance isolation of applications from their co-runners: best overall throughput is always observed when applications share cores and each is executed with multiple threads per core. Rather than “resource” symbiosis, our results indicate that the determining behavioral factor when applications share a system is the granularity of the synchronization operations.

S Hofmeyr, C Iancu, F Blagojević, "Load balancing on speed", Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, January 1, 2010, 147--157, doi: 10.1145/1693453.1693475

Download File: ppopp141-hofmeyr.pdf (pdf: 1.1 MB)

Filip Blagojevic, Costin Iancu, Katherine A. Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose, "Scheduling dynamic parallelism on accelerators.", Computing Frontiers (CF), June 6, 2009,

C Iancu, S Hofmeyr, "Runtime optimization of vector operations on large scale SMP clusters", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 2008, 122--132, doi: 10.1145/1454115.1454134

Download File: pact120-iancu.pdf (pdf: 1.2 MB)

Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, "Productivity and Performance Using Partitioned Global Address Space Languages", Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (PASCO), July 2007, 24--32, doi: 10.1145/1278177.1278183

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of Java T M designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

Wei-Yu Chen, Dan Bonachea, Costin Iancu, Katherine A. Yelick, "Automatic nonblocking communication for partitioned global address space programs", Proceedings of the International Conference on Supercomputing (ICS), June 17, 2007, 158--167, doi: 10.1145/1274971.1274995

Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.

Costin Iancu, Erich Strohmaier, "Optimizing communication overlap for high-speed networks", Principles and Practice of Parallel Programming (PPoPP), 2007,

Download File: paper58-iancu.pdf (pdf: 469 KB)

C Iancu, W Chen, K Yelick, "Performance portable optimizations for loops containing communication operations", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, January 1, 2007, 411, doi: 10.1109/PACT.2007.4336239

Download File: portableperformance2.pdf (pdf: 333 KB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

Costin Iancu, Parry Husbands, Paul Hargrove, "HUNTing the Overlap", IEEE Parallel Architectures and Compilation Techniques (PACT), September 2005, doi: 10.1109/PACT.2005.25

Hiding communication latency is an important optimization for parallel programs. Programmers or compilers achieve this by using non-blocking communication primitives and overlapping communication with computation or other communication operations. Using non-blocking communication raises two issues: performance and programmability. In terms of performance, optimizers need to find a good communication schedule and are sometimes constrained by lack of full application knowledge. In terms of programmability, efficiently managing non-blocking communication can prove cumbersome for complex applications. In this paper we present the design principles of HUNT, a runtime system designed to search and exploit some of the available overlap present at execution time in UPC programs. Using virtual memory support, our runtime implements demand-driven synchronization for data involved in communication operations. It also employs message decomposition and scheduling heuristics to transparently improve the non-blocking behavior of applications. We provide a user level implementation of HUNT on a variety of modern high performance computing systems. Results indicate that our approach is successful in finding some of the overlap available at execution time. While system and application characteristics influence performance, perhaps the determining factor is the time taken by the CPU to execute a signal handler. Demand driven synchronization at execution time eliminates the need for the explicit management of non-blocking communication. Besides increasing programmer productivity, this feature also simplifies compiler analysis for communication optimizations.

WY Chen, C Iancu, K Yelick, "Communication optimizations for fine-grained UPC applications", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 2005, 2005:267--272, doi: 10.1109/PACT.2005.13

Costin Iancu, Parry Husbands, Wei Chen, "Message Strip-Mining Heuristics for High Speed Networks", VECPAR, 2004, doi: 10.1007/11403937_33

In this work we investigate how the compiler technique of message strip-mining performs in practice on contemporary high performance networks. Message strip-mining attempts to reduce the overall cost of communication in parallel programs by breaking up large message transfers into smaller ones that can be overlapped with computation. In practice, however, network resource constraints may negate the expected performance gains. By deriving a performance model and synthetic benchmarks we determine how network and application characteristics influence the applicability of this optimization. We use these findings to determine heuristics to follow when performing this optimization on parallel programs. We propose strip-mining with variable block size as an alternative strategy that performs almost as well as a highly tuned fixed block strategy and has the advantage of being performance portable across systems and application input sets. We evaluate both techniques using synthetic benchmarks and an application from the NAS Parallel Benchmark suite.

Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Katherine Yelick,, "A Performance Analysis of the Berkeley UPC Compiler", Proceedings of the International Conference on Supercomputing (ICS), ACM, June 23, 2003, 63--73, doi: 10.1145/782814.782825

Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI [7]. In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.

Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael L. Welcome, Katherine A. Yelick, "An Evaluation of Current High-Performance Networks", Proceedings of the International Parallel & Distributed Processing Symposium (IPDPS), April 22, 2003, doi: 10.1109/IPDPS.2003.1213106

High-end supercomputers are increasingly built out of commodity components, and lack tight integration between the processor and network. This often results in inefficiencies in the communication subsystem, such as high software overheads and/or message latencies. In this paper we use a set of microbenchmarks to quantify the cost of this commoditization, measuring software overhead, latency, and bandwidth on five contemporary supercomputing networks. We compare the performance of the ubiquitous MPI layer to that of lower-level communication layers, and quantify the advantages of the latter for small message performance. We also provide data on the potential for various communication-related optimizations, such as overlapping communication with computation or other communication. Finally, we determine the minimum size needed for a message to be considered 'large' (i.e., bandwidth-bound) on these platforms, and provide historical data on the software overheads of a number of supercomputers over the past decade.

Costin Iancu, Anurag Acharya, "An evaluation of search tree techniques in the presence of caches", International Symposium on Performance Analysis of Systems and Software, June 6, 2001,

Costin Iancu, Anurag Acharya, "A Comparison of Feedback Based and Fair Queuing Mechanisms for Handling Unresponsive Traffic", IEEE Symposium on Computers and Communications (ISCC), 2001,

Andrew Duncan, Bogdan Cocosel, Costin Iancu, Holger Kienle, Radu Rugina, Urs Holzle, "OSUIF: SUIF 2.0 With Objects", 2nd SUIF Compiler Workshop, June 6, 1997,

Book Chapters

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

Presentation/Talks

Yili Zheng, Filip Blagojevic, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Costin Iancu, Seung-Jai Min, Katherine Yelick, Getting Multicore Performance with UPC, SIAM Conference on Parallel Processing for Scientific Computing, February 2010,

Download File: Multicore-Performance-with-UPC-SIAMPP10-Zheng.pdf (pdf: 933 KB)

Yili Zheng, Costin Iancu, Paul H. Hargrove, Seung-Jai Min, Katherine Yelick, Extending Unified Parallel C for GPU Computing, SIAM Conference on Parallel Processing for Scientific Computing, February 24, 2010,

Reports

Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, "OPR: Partial Deterministic Record and Replay for One-Sided Communication", LBNL TR, April 17, 2015,

Download File: main3.pdf (pdf: 295 KB)

UPC Consortium, "UPC Language and Library Specifications, Version 1.3", Lawrence Berkeley National Laboratory Technical Report, November 16, 2013, LBNL 6623E, doi: 10.2172/1134233

UPC is an explicitly parallel extension to the ISO C 99 Standard. UPC follows the partitioned global address space programming model. This document is the formal specification for the UPC language and library syntax and semantics, and supersedes prior specification version 1.2 (LBNL-59208).

UPC Consortium, "UPC Language Specifications, v1.2", Lawrence Berkeley National Laboratory Technical Report, May 31, 2005, LBNL 59208, doi: 10.2172/862127

Michael Schmitt. Anurag Acharya, Max Ibel, Costin Iancu, "Service Sockets: A Uniform User-Level Interface for Networking Applications", 2001,

Posters

Chang-Seo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", Principles and Practice of Parallel Programming (PPoPP 2013), March 4, 2013,

Download File: thrille-exp5.pdf (pdf: 200 KB)

Costin Iancu, Wei Chen, Katherine A. Yelick, "Performance Portable Optimizations for Loops Containing Communication Operations", IEEE PACT, 2007,

Download File: portableperformance.pdf (pdf: 333 KB)

Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome, Kathy Yelick, "GASNet 2 - An Alternative High-Performance Communication Interface", ACM/IEEE Conference on Supercomputing (SC'04) Poster Session, November 2004,

Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview (SC'03)", ACM/IEEE Conference on Supercomputing (SC'03) Poster Session, November 2003,

Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview (SC'02)", ACM/IEEE Conference on Supercomputing (SC'02) Poster Session, November 2002,

Annual Reports

James Demmel, Costin Iancu, Koushik Sen, "Corvette Progress Report 2015", April 1, 2015,

Download File: corvette-progress.pdf (pdf: 198 KB)

Others

Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu, QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space, Bulletin of the American Physical Society, March 2021,

We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.

Akel Hashim, Ravi Naik, Alexis Morvan, Jean-Loup Ville, Brad Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin O Brien, Ian Hincks, Joel Wallman, Joseph V Emerson, David Ivan Santiago, Irfan Siddiqi, Scalable Quantum Computing on a Noisy Superconducting Quantum Processor via Randomized Compiling, Bulletin of the American Physical Society, 2021,

Coherent errors in quantum hardware severely limit the performance of quantum algorithms in an unpredictable manner, and mitigating their impact is necessary for realizing reliable, large-scale quantum computations. Randomized compiling achieves this goal by converting coherent errors into stochastic noise, dramatically reducing unpredictable errors in quantum algorithms and enabling accurate predictions of aggregate performance via cycle benchmarking estimates. In this work, we demonstrate significant performance gains under randomized compiling for both the four-qubit quantum Fourier transform algorithm and for random circuits of variable depth on a superconducting quantum processor. We also validate solution accuracy using experimentally-measured error rates. Our results demonstrate that randomized compiling can be utilized to maximally-leverage and predict the capabilities of modern-day noisy quantum processors, paving the way forward for scalable quantum computing.