I am performing research in the areas of programming models and code optimization for large-scale parallel systems. I tend to favor simple and practical designs. Over the years I've been involved in multiple projects.
- Big Data Support on HPC Systems. (Intel Parallel Computing Center): There exists an incentive in the HPC community to run commercial data analytics frameworks (Hadoop, Spark) on capability systems. Adoption so far has been hampered by design decisions that are commercial data-center specific, e.g. local disk available. We plan to explore R&D for technologies to enable Spark to run on large-scale HPC systems. The IPCC is part of our larger research agenda and focuses on improving the interaction between Spark and the Lustre parallel global file system on production systems at NERSC. First notable accomplishment is that we show how to scale Spark from 100 cores to 10.000 cores on production HPC systems.
- CORVETTE (Correctness, Verification and Testing of Parallel Programs): I'm interested in tools that assist developers in bug finding or program state exploration for debugging or optimization purposes. The focus is on developing a scalable dynamic program analysis framework for the HPC programming models
- Berkeley UPC I wrote a lot of the compiler code and moved on to code optimizations, where I've been successful when using a combination of static and dynamic program analyses to guide communication and task scheduling optimizations. To guide optimizations, I'd rather use "qualitative" performance models over the more traditional approaches: track derivatives and preserve ordering rather than predict absolute values.
- THOR (Throughput Oriented Runtime): In this project, we are trying to build a software overlay network to perform on-the-fly communication optimizations for large-scale systems. There are three dimensions of control: concurrency, order, and granularity. We have developed mechanisms to regulate the concurrency of communication operations, avoid congestion, scheduling/reordering for improved throughput, etc.
The individual project pages contain more information: slides, posters, software, fun ...
Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720
Nicholas Chaimov, Khaled Ibrahim, Samuel Williams, Costin Iancu, "Exploiting Communication Concurrency on High Performance Computing Systems", IJHPCA, April 17, 2015,
- Download File: thorserv2.pdf (pdf: 1.7 MB)
Khaled Z. Ibrahim, Steven Hofmeyr, Costin Iancu, "The Case for Partitioning Virtual Machines on Manycore Architectures", IEEE TPDS, April 17, 2014,
- Download File: TPDSsubmit.pdf (pdf: 5.6 MB)
S Hofmeyr, J Colmenares, J Kubiatowicz, C Iancu, "Juggle: Addressing Extrinsic Load Imbalances in SPMD Applications on Multicore Computer", Cluster Computing, 2012,
Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,
- Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)
Ethan H. Smith, Marc G. Davis, Jeffery M. Larson, Costin Iancu, "LEAP: Scaling Numerical Optimization Based Synthesis Using an Incremental Approach", International Workshop of Quantum Computing Software at Supercomputing, November 2020,
- Download File: LEAP-Heuristics-for-Near-Optimal-Synthesis-Targeting-NISQ-Devices.pdf (pdf: 125 KB)
Marc G. Davis, Ethan Smith, Ana Tudor, Koushik Sen, Irfan Siddiqi, Costin Iancu, "Towards Optimal Topology Aware Quantum Circuit Synthesis", 2020 IEEE International Conference on Quantum Computing and Engineering (QCE), Denver, CO, USA, IEEE, October 12, 2020, doi: 10.1109/QCE49297.2020.00036
We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As CNOT gates are error-prone for the foreseeable Noisy-Intermediate-Scale Quantum devices era, our A* inspired algorithm minimizes their count while accounting for connectivity. We discuss the formulation of synthesis as a search problem as well as an algorithm to find solutions. For a workload of circuits with complexity appropriate for the NISQ era, we produce solutions well within the best upper bounds published in literature and match or exceed hand tuned implementations, as well as other existing synthesis alternatives. In particular, when comparing against state-of-the-art available synthesis packages we show 2.4× average (up to 5.3×) reduction in CNOT count. We also show how to re-target the algorithm for a different chip topology and native gate set while obtaining similar quality results. We believe that tools like ours can facilitate algorithmic exploration and guide gate set discovery for quantum processor designers, as well as being useful for optimization in the quantum compilation tool-chain.
Marc Grau Davis, Ethan Smith, Ana Tudor, Koushik Sen, Irfan Siddiqi, Costin Iancu, "Heuristics for Quantum Compiling with a Continuous Gate Set", 3rd International Workshop on Quantum Compilation as part of the International Conference On Computer Aided Design 2019, December 5, 2019,
We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As accurate CNOT gates are hard for the foreseeable Noisy- Intermediate-Scale Quantum devices era, our A* inspired algorithm attempts to minimize their count, while accounting for connectivity. We discuss the search strategy together with metrics to expand the solution frontier. For a workload of circuits with complexity appropriate for the NISQ era, we produce solutions well within the best upper bounds published in literature and match or exceed hand tuned implementations, as well as other existing synthesis alternatives. In particular, when comparing against state-of-the-art available synthesis packages we show 2.4x average (up to 5.3x) reduction in CNOT count. We also show how to re-target the algorithm for a different chip topology and native gate set, while obtaining similar quality results. We believe that empirical tools like ours can facilitate algorithmic exploration, gate set discovery for quantum processor designers, as well as providing useful optimization blocks within the quantum compilation tool-chain.
Wim Lavrijsen, Costin Iancu, Wibe Albert de Jong, Xin Chen, Karsten Schwan, "Exploiting Variability for Energy Optimization of Load Balanced Parallel Programs", EuroSys 2016, February 5, 2016,
- Download File: iancu-eurosys16.pdf (pdf: 2.6 MB)
Nicholas Chaimov, Allen Malony, Shane Canon, Costin Iancu, Khaled Ibrahim, Jay Srinivasan, "Scaling Spark on HPC Systems", High Performance and Distributed Computing (HPDC), February 5, 2016,
- Download File: spark-on-cray.pdf (pdf: 2.6 MB)
Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, "SReplay: Deterministic Sub-Group Replay for One-Sided Communication", International Conference on Supercomputing (ICS), 2016, February 5, 2016,
- Download File: iancu-ics16.pdf (pdf: 411 KB)
Cuong Nguyen, Cindy Rubio-Gonzalez, Benjamin Mehne, Koushik Sen, Costin Iancu, James Demmel, William Kahan, Wim Lavrijsen, David H. Bailey, David Hough, "Floating-point precision tuning using blame analysis", 38th International Conference on Software Engineering (ICSE 2016), January 20, 2016,
- Download File: icse16-blame-analysis.pdf (pdf: 1 MB)
Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,
- Download File: pmam15-servers.pdf (pdf: 1.2 MB)
Milind Chabbi, Wim Lavrijsen, Wibe de Jong, Koushik Sen, John Mellor Crummey, Costin Iancu, "Barrier Elision for Production Parallel Programs", PPOPP 2015, February 5, 2015,
- Download File: nwbar.pdf (pdf: 663 KB)
M Chabbi, W Lavrijsen, W De Jong, K Sen, J Mellor-Crummey, C Iancu, "Barrier elision for production parallel programs", Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, January 1, 2015, 2015-Jan:109--119, doi: 10.1145/2688500.2688502
Khaled Ibrahim, Paul Hargrove, Costin Iancu, Katherine Yelick, "A Performance Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect", IPDPS 2014, April 17, 2014,
Cindy Rubio-Gonzalez, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, David Hough, "Precimonious: Tuning Assistant for Floating-Point Precision", Supercomputing 2013, July 19, 2013,
- Download File: sc13.pdf (pdf: 555 KB)
Chang-Seo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", International Supercomputing Conference (ICS) 2013, 2013,
- Download File: thrille-exp4.pdf (pdf: 744 KB)
Miao Luo, Dhabaleswar K. Panda, Khaled Z. Ibrahim, Costin Iancu, "Congestion avoidance on manycore high performance computing systems", International Conference on Supercomputing (ICS), 2012,
- Download File: ics105-luo.pdf (pdf: 1.8 MB)
Chang-Seo Park, Koushik Sen, Paul Hargrove, Costin Iancu, "Efficient data race detection for distributed memory parallel programs", Supercomputing (SC), 2011,
- Download File: upcthrille.pdf (pdf: 457 KB)
Seung-Jai Min, Costin Iancu and Katherine Yelick, "Hierarchical Work Stealing on Manycore Clusters", Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11), 2011,
- Download File: task.pdf (pdf: 703 KB)
KZ Ibrahim, S Hofmeyr, C Iancu, E Roman, "Optimized pre-copy live migration for memory intensive applications", Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, doi: 10.1145/2063384.2063437
- Download File: VMMigrationsubmit.pdf (pdf: 791 KB)
S Hofmeyr, JA Colmenares, C Iancu, J Kubiatowicz, "Juggle: Proactive load balancing on multicore computers", Proceedings of the IEEE International Symposium on High Performance Distributed Computing, 2011, 3--14, doi: 10.1145/1996130.1996134
KZ Ibrahim, S Hofmeyr, C Iancu, "Characterizing the performance of parallel applications on multi-socket virtual machines", Proceedings - 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2011, 2011, 1--12, doi: 10.1109/CCGrid.2011.50
- Download File: ccgrid11submit.pdf (pdf: 427 KB)
Filip Blagojevic, Paul Hargrove, Costin Iancu, and Katherine Yelick, "Hybrid PGAS Runtime Support for Multicore Nodes", Fourth Conference on Partitioned Global Address Space Programming Model (PGAS10), October 2010,
- Download File: paper2.pdf (pdf: 1 MB)
Filip Blagojevic, Costin Iancu, Katherine A. Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose, "Scheduling dynamic parallelism on accelerators.", Computing Frontiers (CF), June 6, 2009,
Costin Iancu, Wei Chen, Katherine A. Yelick, "Performance portable optimizations for loops containing communication operations", International Conference on Supercomputing (ICS), June 6, 2008,
- Download File: portableperformance2.pdf (pdf: 333 KB)
Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, "Productivity and Performance Using Partitioned Global Address Space Languages", Parallel Symbolic Computation (PASCO'07), July 2007, doi: 10.1145/1278177.1278183
Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of Java T M designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.
Costin Iancu, Erich Strohmaier, "Optimizing communication overlap for high-speed networks", Principles and Practice of Parallel Programming (PPoPP), 2007,
- Download File: paper58-iancu.pdf (pdf: 469 KB)
Wei-Yu Chen, Dan Bonachea, Costin Iancu, Katherine A. Yelick, "Automatic nonblocking communication for partitioned global address space programs", International Conference on Supercomputing (ICS), 2007, doi: 10.1145/1274971.1274995
Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.
L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate PetaScale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259
- Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)
Wei-Yu Chen, Costin Iancu, Katherine A. Yelick, "Communication Optimizations for Fine-Grained UPC Applications", IEEE Parallel Architectures and Compilation Techniques, 2005,
Costin Iancu, Parry Husbands, Paul Hargrove, "HUNTing the Overlap", IEEE Parallel Architectures and Compilation Techniques (PACT), 2005,
Costin Iancu, Parry Husbands, Wei Chen, "Message Strip-Mining Heuristics for High Speed Networks", VECPAR, 2004,
Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Katherine Yelick, "A Performance Analysis of the Berkeley UPC Compiler", Proceedings of the International Conference on Supercomputing (ICS) 2003, December 1, 2003, doi: 10.1145/782814.782825
Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI . In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.
Parry Husbands, Costin Iancu, Katherine A. Yelick, "A performance analysis of the Berkeley UPC compiler", International Conference on Supercomputing (ICS), 2003,
Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael L. Welcome, Katherine A. Yelick, "An Evaluation of Current High-Performance Networks", International Parallel & Distributed Processing Symposium (IPDPS), April 2003, doi: 10.1109/IPDPS.2003.1213106
High-end supercomputers are increasingly built out of commodity components, and lack tight integration between the processor and network. This often results in inefficiencies in the communication subsystem, such as high software overheads and/or message latencies. In this paper we use a set of microbenchmarks to quantify the cost of this commoditization, measuring software overhead, latency, and bandwidth on five contemporary supercomputing networks. We compare the performance of the ubiquitous MPI layer to that of lower-level communication layers, and quantify the advantages of the latter for small message performance. We also provide data on the potential for various communication-related optimizations, such as overlapping communication with computation or other communication. Finally, we determine the minimum size needed for a message to be considered 'large' (i.e., bandwidth-bound) on these platforms, and provide historical data on the software overheads of a number of supercomputers over the past decade.
Costin Iancu, Anurag Acharya, "A Comparison of Feedback Based and Fair Queuing Mechanisms for Handling Unresponsive Traffic", IEEE Symposium on Computers and Communications (ISCC), 2001,
Costin Iancu, Anurag Acharya, "An evaluation of search tree techniques in the presence of caches", International Symposium on Performance Analysis of Systems and Software, June 6, 2001,
Andrew Duncan, Bogdan Cocosel, Costin Iancu, Holger Kienle, Radu Rugina, Urs Holzle, "OSUIF: SUIF 2.0 With Objects", 2nd SUIF Compiler Workshop, June 6, 1997,
L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)
Yili Zheng, Filip Blagojevic, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Costin Iancu, Seung-Jai Min, Katherine Yelick, Getting Multicore Performance with UPC, SIAM Conference on Parallel Processing for Scientific Computing, February 2010,
- Download File: Multicore-Performance-with-UPC-SIAMPP10-Zheng.pdf (pdf: 933 KB)
Yili Zheng, Costin Iancu, Paul H. Hargrove, Seung-Jai Min, Katherine Yelick, Extending Unified Parallel C for GPU Computing, SIAM Conference on Parallel Processing for Scientific Computing, February 24, 2010,
Cindy Rubio-Gonz ́alez, Cuong Nguyen, James Demmel, William Kahan, Koushik Sen, Wim Lavrijsen, Costin Iancu, "Floating Point Precision Tuning Using Blame Analysis", LBNL TR, April 17, 2015,
- Download File: main2.pdf (pdf: 830 KB)
Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, "OPR: Partial Deterministic Record and Replay for One-Sided Communication", LBNL TR, April 17, 2015,
- Download File: main3.pdf (pdf: 295 KB)
UPC Consortium, "UPC Language and Library Specifications, Version 1.3", Lawrence Berkeley National Laboratory Technical Report, November 16, 2013, LBNL 6623E, doi: 10.2172/1134233
UPC is an explicitly parallel extension to the ISO C 99 Standard. UPC follows the partitioned global address space programming model. This document is the formal specification for the UPC language and library syntax and semantics, and supersedes prior specification version 1.2 (LBNL-59208).
UPC Consortium, "UPC Language Specifications, v1.2", Lawrence Berkeley National Laboratory Technical Report, May 31, 2005, LBNL 59208, doi: 10.2172/862127
UPC is an explicitly parallel extension to the ISO C 99 Standard. UPC follows the partitioned global address space programming model. This document is the formal specification for the UPC language syntax and semantics.
Michael Schmitt. Anurag Acharya, Max Ibel, Costin Iancu, "Service Sockets: A Uniform User-Level Interface for Networking Applications", 2001,
Chang-Seo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", Principles and Practice of Parallel Programming (PPoPP 2013), March 4, 2013,
- Download File: thrille-exp5.pdf (pdf: 200 KB)
Costin Iancu, Wei Chen, Katherine A. Yelick, "Performance Portable Optimizations for Loops Containing Communication Operations", IEEE PACT, 2007,
- Download File: portableperformance.pdf (pdf: 333 KB)
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome, Kathy Yelick, "GASNet 2 - An Alternative High-Performance Communication Interface", ACM/IEEE Conference on Supercomputing (SC'04) Poster Session, November 2004,
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview", ACM/IEEE Conference on Supercomputing (SC'03) Poster Session, November 2003,
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview", ACM/IEEE Conference on Supercomputing (SC'02) Poster Session, November 2002,
James Demmel, Costin Iancu, Kouhsik Sen, "Corvette Progress Report 2015", April 1, 2015,
- Download File: corvette-progress.pdf (pdf: 198 KB)
Akel Hashim, Ravi Naik, Alexis Morvan, Jean-Loup Ville, Brad Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin O Brien, Ian Hincks, Joel Wallman, Joseph V Emerson, David Ivan Santiago, Irfan Siddiqi, Scalable Quantum Computing on a Noisy Superconducting Quantum Processor via Randomized Compiling, Bulletin of the American Physical Society, 2021,
Coherent errors in quantum hardware severely limit the performance of quantum algorithms in an unpredictable manner, and mitigating their impact is necessary for realizing reliable, large-scale quantum computations. Randomized compiling achieves this goal by converting coherent errors into stochastic noise, dramatically reducing unpredictable errors in quantum algorithms and enabling accurate predictions of aggregate performance via cycle benchmarking estimates. In this work, we demonstrate significant performance gains under randomized compiling for both the four-qubit quantum Fourier transform algorithm and for random circuits of variable depth on a superconducting quantum processor. We also validate solution accuracy using experimentally-measured error rates. Our results demonstrate that randomized compiling can be utilized to maximally-leverage and predict the capabilities of modern-day noisy quantum processors, paving the way forward for scalable quantum computing.
Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu, QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space, Bulletin of the American Physical Society, March 2021,
We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.