I am performing research in the areas of programming models and code optimization for large scale parallel systems. I tend to favor simple and practical designs. Over the years I've been involved in multiple projects.
- Big Data Support on HPC Systems. (Intel Parallel Computing Center): There exists incentive in the HPC community to run commercial data analytics frameworks (Hadoop, Spark) on capability systems. Adoption, so far has been hampered by design decisions that are commercial data-center specific, e.g. local disk available. We plan to explore R&D for technologies to enable Spark to run on large scale HPC systems. The IPCC is part of our larger research agenda and focuses on improving the interaction between Spark and the Lustre parallel global file system on production systems at NERSC. First notable accomplishment is that we show how to scale Spark from 100 cores to 10.000 cores on production HPC systems.
- CORVETTE (Correctness, Verification and Testing of Parallel Programs): I'm interested in tools that assist developers in bug finding or program state exploration for debugging or optimization purposes. The focus is on developing a scalable dynamic program analysis framework for the HPC programming models
- Berkeley UPC: I did write a lot of the compiler code and moved on to code optimizations, where I've been successful when using a combination of static and dynamic program analyses to guide communication and task scheduling optimizations. To guide optimizations, I'd rather use "qualitative" performance models over the more traditional approaches: track derivatives and preserve ordering rather than predict absolute values.
- THOR (Throughput Oriented Runtime): In this project we are trying to build a software overlay network to perform on-the-fly communication optimizations for large scale systems. There are three dimensions of control: concurrency, order and granularity. We have developed mechanisms to regulate the concurrency of communication operations, avoid congestion, scheduling/reordering for improved throughput etc.
The individual project pages contain more information: slides, posters, software, fun ...
Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720
Nicholas Chaimov, Khaled Ibrahim, Samuel Williams, Costin Iancu, "Exploiting Communication Concurrency on High Performance Computing Systems", IJHPCA, April 17, 2015,
- Download File: thorserv2.pdf (pdf: 1.7 MB)
Khaled Z. Ibrahim, Steven Hofmeyr, Costin Iancu, "The Case for Partitioning Virtual Machines on Manycore Architectures", IEEE TPDS, April 17, 2014,
- Download File: TPDSsubmit.pdf (pdf: 5.6 MB)
S Hofmeyr, J Colmenares, J Kubiatowicz, C Iancu, "Juggle: Addressing Extrinsic Load Imbalances in SPMD Applications on Multicore Computer", Cluster Computing, 2012,
Wim Lavrijsen, Costin Iancu, Wibe Albert de Jong, Xin Chen, Karsten Schwan, "Exploiting Variability for Energy Optimization of Load Balanced Parallel Programs", EuroSys 2016, February 5, 2016,
- Download File: iancu-eurosys16.pdf (pdf: 2.6 MB)
Nicholas Chaimov, Allen Malony, Shane Canon, Costin Iancu, Khaled Ibrahim, Jay Srinivasan, "Scaling Spark on HPC Systems", High Performance and Distributed Computing (HPDC), February 5, 2016,
- Download File: spark-on-cray.pdf (pdf: 2.6 MB)
Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, "SReplay: Deterministic Sub-Group Replay for One-Sided Communication", International Conference on Supercomputing (ICS), 2016, February 5, 2016,
- Download File: iancu-ics16.pdf (pdf: 411 KB)
Cuong Nguyen, Cindy Rubio-Gonzalez, Benjamin Mehne, Koushik Sen, Costin Iancu, James Demmel, William Kahan, Wim Lavrijsen, David H. Bailey, David Hough, "Floating-point precision tuning using blame analysis", 38th International Conference on Software Engineering (ICSE 2016), January 20, 2016,
- Download File: icse16-blame-analysis.pdf (pdf: 1 MB)
Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,
- Download File: pmam15-servers.pdf (pdf: 1.2 MB)
Milind Chabbi, Wim Lavrijsen, Wibe de Jong, Koushik Sen, John Mellor Crummey, Costin Iancu, "Barrier Elision for Production Parallel Programs", PPOPP 2015, February 5, 2015,
- Download File: nwbar.pdf (pdf: 663 KB)
M Chabbi, W Lavrijsen, W De Jong, K Sen, J Mellor-Crummey, C Iancu, "Barrier elision for production parallel programs", Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, January 1, 2015, 2015-Jan:109--119, doi: 10.1145/2688500.2688502
Khaled Ibrahim, Paul Hargrove, Costin Iancu, Katherine Yelick, "A Performance Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect", IPDPS 2014, April 17, 2014,
Cindy Rubio-Gonzalez, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, David Hough, "Precimonious: Tuning Assistant for Floating-Point Precision", Supercomputing 2013, July 19, 2013,
- Download File: sc13.pdf (pdf: 555 KB)
Chang-Seo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", International Supercomputing Conference (ICS) 2013, 2013,
- Download File: thrille-exp4.pdf (pdf: 744 KB)
Miao Luo, Dhabaleswar K. Panda, Khaled Z. Ibrahim, Costin Iancu, "Congestion avoidance on manycore high performance computing systems", International Conference on Supercomputing (ICS), 2012,
- Download File: ics105-luo.pdf (pdf: 1.8 MB)
Chang-Seo Park, Koushik Sen, Paul Hargrove, Costin Iancu, "Efficient data race detection for distributed memory parallel programs", Supercomputing (SC), 2011,
- Download File: upcthrille.pdf (pdf: 457 KB)
Seung-Jai Min, Costin Iancu and Katherine Yelick, "Hierarchical Work Stealing on Manycore Clusters", Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11), 2011,
- Download File: task.pdf (pdf: 703 KB)
KZ Ibrahim, S Hofmeyr, C Iancu, E Roman, "Optimized pre-copy live migration for memory intensive applications", Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, doi: 10.1145/2063384.2063437
- Download File: VMMigrationsubmit.pdf (pdf: 791 KB)
S Hofmeyr, JA Colmenares, C Iancu, J Kubiatowicz, "Juggle: Proactive load balancing on multicore computers", Proceedings of the IEEE International Symposium on High Performance Distributed Computing, 2011, 3--14, doi: 10.1145/1996130.1996134
KZ Ibrahim, S Hofmeyr, C Iancu, "Characterizing the performance of parallel applications on multi-socket virtual machines", Proceedings - 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2011, 2011, 1--12, doi: 10.1109/CCGrid.2011.50
- Download File: ccgrid11submit.pdf (pdf: 427 KB)
Filip Blagojevic, Paul Hargrove, Costin Iancu, and Katherine Yelick, "Hybrid PGAS Runtime Support for Multicore Nodes", Fourth Conference on Partitioned Global Address Space Programming Model (PGAS10), October 2010,
- Download File: paper2.pdf (pdf: 1 MB)
Filip Blagojevic, Costin Iancu, Katherine A. Yelick, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, Benjamin Rose, "Scheduling dynamic parallelism on accelerators.", Computing Frontiers (CF), June 6, 2009,
Costin Iancu, Wei Chen, Katherine A. Yelick, "Performance portable optimizations for loops containing communication operations", International Conference on Supercomputing (ICS), June 6, 2008,
- Download File: portableperformance2.pdf (pdf: 333 KB)
Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, "Productivity and Performance Using Partitioned Global Address Space Languages", Parallel Symbolic Computation (PASCO'07), July 2007, doi: 10.1145/1278177.1278183
Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of Java T M designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.
Costin Iancu, Erich Strohmaier, "Optimizing communication overlap for high-speed networks", Principles and Practice of Parallel Programming (PPoPP), 2007,
- Download File: paper58-iancu.pdf (pdf: 469 KB)
Wei-Yu Chen, Dan Bonachea, Costin Iancu, Katherine A. Yelick, "Automatic nonblocking communication for partitioned global address space programs", International Conference on Supercomputing (ICS), 2007, doi: 10.1145/1274971.1274995
Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.
L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate PetaScale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259
- Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)
Wei-Yu Chen, Costin Iancu, Katherine A. Yelick, "Communication Optimizations for Fine-Grained UPC Applications", IEEE Parallel Architectures and Compilation Techniques, 2005,
Costin Iancu, Parry Husbands, Paul Hargrove, "HUNTing the Overlap", IEEE Parallel Architectures and Compilation Techniques (PACT), 2005,
Costin Iancu, Parry Husbands, Wei Chen, "Message Strip-Mining Heuristics for High Speed Networks", VECPAR, 2004,
Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Katherine Yelick, "A Performance Analysis of the Berkeley UPC Compiler", Proceedings of the International Conference on Supercomputing (ICS) 2003, December 1, 2003, doi: 10.1145/782814.782825
Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI . In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.
Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael L. Welcome, Katherine A. Yelick, "An Evaluation of Current High-Performance Networks", International Parallel & Distributed Processing Symposium (IPDPS), 2003, doi: 10.1109/IPDPS.2003.1213106
High-end supercomputers are increasingly built out of commodity components, and lack tight integration between the processor and network. This often results in inefficiencies in the communication subsystem, such as high software overheads and/or message latencies. In this paper we use a set of microbenchmarks to quantify the cost of this commoditization, measuring software overhead, latency, and bandwidth on five contemporary supercomputing networks. We compare the performance of the ubiquitous MPI layer to that of lower-level communication layers, and quantify the advantages of the latter for small message performance. We also provide data on the potential for various communication-related optimizations, such as overlapping communication with computation or other communication. Finally, we determine the minimum size needed for a message to be considered 'large' (i.e., bandwidth-bound) on these platforms, and provide historical data on the software overheads of a number of supercomputers over the past decade.
Parry Husbands, Costin Iancu, Katherine A. Yelick, "A performance analysis of the Berkeley UPC compiler", International Conference on Supercomputing (ICS), 2003,
Costin Iancu, Anurag Acharya, "A Comparison of Feedback Based and Fair Queuing Mechanisms for Handling Unresponsive Traffic", IEEE Symposium on Computers and Communications (ISCC), 2001,
Costin Iancu, Anurag Acharya, "An evaluation of search tree techniques in the presence of caches", International Symposium on Performance Analysis of Systems and Software, June 6, 2001,
Andrew Duncan, Bogdan Cocosel, Costin Iancu, Holger Kienle, Radu Rugina, Urs Holzle, "OSUIF: SUIF 2.0 With Objects", 2nd SUIF Compiler Workshop, June 6, 1997,
L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)
Yili Zheng, Filip Blagojevic, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Costin Iancu, Seung-Jai Min, Katherine Yelick, Getting Multicore Performance with UPC, SIAM Conference on Parallel Processing for Scientific Computing, February 2010,
- Download File: Multicore-Performance-with-UPC-SIAMPP10-Zheng.pdf (pdf: 933 KB)
Yili Zheng, Costin Iancu, Paul H. Hargrove, Seung-Jai Min, Katherine Yelick, Extending Unified Parallel C for GPU Computing, SIAM Conference on Parallel Processing for Scientific Computing, February 24, 2010,
Cindy Rubio-Gonz ́alez, Cuong Nguyen, James Demmel, William Kahan, Koushik Sen, Wim Lavrijsen, Costin Iancu, "Floating Point Precision Tuning Using Blame Analysis", LBNL TR, April 17, 2015,
- Download File: main2.pdf (pdf: 830 KB)
Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, "OPR: Partial Deterministic Record and Replay for One-Sided Communication", LBNL TR, April 17, 2015,
- Download File: main3.pdf (pdf: 295 KB)
UPC Consortium, "UPC Language and Library Specifications, Version 1.3", Lawrence Berkeley National Laboratory Technical Report, November 16, 2013, LBNL 6623E, doi: 10.2172/1134233
UPC is an explicitly parallel extension to the ISO C 99 Standard. UPC follows the partitioned global address space programming model. This document is the formal specification for the UPC language and library syntax and semantics, and supersedes prior specification version 1.2 (LBNL-59208).
UPC Consortium, "UPC Language Specifications, v1.2", Lawrence Berkeley National Laboratory Technical Report, May 31, 2005, LBNL 59208, doi: 10.2172/862127
UPC is an explicitly parallel extension to the ISO C 99 Standard. UPC follows the partitioned global address space programming model. This document is the formal specification for the UPC language syntax and semantics.
Michael Schmitt. Anurag Acharya, Max Ibel, Costin Iancu, "Service Sockets: A Uniform User-Level Interface for Networking Applications", 2001,
Chang-Seo Park, Koushik Sen, Costin Iancu, "Scaling Data Race Detection for Partitioned Global Address Space Programs", Principles and Practice of Parallel Programming (PPoPP 2013), March 4, 2013,
- Download File: thrille-exp5.pdf (pdf: 200 KB)
Costin Iancu, Wei Chen, Katherine A. Yelick, "Performance Portable Optimizations for Loops Containing Communication Operations", IEEE PACT, 2007,
- Download File: portableperformance.pdf (pdf: 333 KB)
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome, Kathy Yelick, "GASNet 2 - An Alternative High-Performance Communication Interface", ACM/IEEE Conference on Supercomputing (SC'04) Poster Session, November 9, 2004,
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview", ACM/IEEE Conference on Supercomputing (SC'03) Poster Session, November 2003,
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview", ACM/IEEE Conference on Supercomputing (SC'02) Poster Session, November 2002,
James Demmel, Costin Iancu, Kouhsik Sen, "Corvette Progress Report 2015", April 1, 2015,
- Download File: corvette-progress.pdf (pdf: 198 KB)