Skip to navigation Skip to content
Careers | Phone Book | A - Z Index
Computer Architecture Group



Meriam Gay Bautista, Zhi Jackie Yao, Anastasiia Butko, Mariam Kiran, Mekena Metcalf, "Towards Automated Superconducting Circuit Calibration using Deep Reinforcement Learning", 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Tampa, FL, USA, IEEE, August 23, 2021, pp. 462-46, doi: 10.1109/ISVLSI51109.2021.00091

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671


Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Tan Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, Douglas Doerfler, Nicholas J. Wright, "The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing", (BEST PAPER) Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS), November 2020,


Maximilian H Bremer, John D Bachan, Cy P Chan, "Semi-Static and Dynamic Load Balancing for Asynchronous Hurricane Storm Surge Simulations", 2018 Parallel Applications Workshop, Alternatives To MPI (PAW-ATM), November 16, 2018,

Cy P Chan, Bin Wang, John D Bachan, Jane Macfarlane, "Mobiliti: Scalable Transportation Simulation Using High-Performance Parallel Computing", 2018 IEEE International Conference on Intelligent Transportation Systems (ITSC), November 6, 2018,

Anastasiia Butko, Albert Chen, David Donofrio, Farzad Fatollahi-Fard, John Shalf, "Open2C: Open-source Generator for Exploration of Coherent Cache Memory Subsystems", MEMSYS '18, New York, NY, USA, ACM, 2018, 311--317, doi: 10.1145/3240302.3270314

Bin Wang, John D Bachan, Cy P Chan, "ExaGridPF: A parallel power flow solver for transmission and unbalanced distribution systems", 2018 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT), February 22, 2018,


Tan Nguyen, Pietro Cicotti, Eric Bylaska, Dan Quinlan, and Scott Baden, "Automatic Translation of MPI Source into a Latency-tolerant, Data-driven Form", Journal of Parallel and Distributed Computing, February 21, 2017,

T Nguyen, D Unat, W Zhang, A Almgren, N Farooqi, J Shalf, "Perilla: Metadata-Based Optimizations of an Asynchronous Runtime for Adaptive Mesh Refinement", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2017, 945--956, doi: 10.1109/SC.2016.80


Cy Chan, John Bachan, Joseph Kenny, Jeremiah Wilke, Vincent Beckner, Ann Almgren, John Bell, "Topology-Aware Performance Optimization and Modeling of Adaptive Mesh Refinement Codes for Exascale", (BEST PAPER AWARD) COMHPC 2016 - SC16 Workshop on Communication Optimization in High Performance Computing, Salt Lake City, UT, November 18, 2016,

Best Paper Award

Bin Wang, Yubo Wang, Hamidreza Nazaripouya, Charlie Qiu, Chi-Cheng Chu, Rajit Gadh, "Predictive Scheduling Framework for Electric Vehicles with Uncertainties of User Behaviors", IEEE Internet of Things Journal, October 13, 2016, 4:52 - 63, doi: 10.1109/JIOT.2016.2617314

The randomness of user behaviors plays a significant role in electric vehicle (EV) scheduling problems, especially when the power supply for EV supply equipment (EVSE) is limited. Existing EV scheduling methods do not consider this limitation and assume charging session parameters, such as stay duration and energy demand values, are perfectly known, which is not realistic in practice. In this paper, based on real-world implementations of networked EVSEs on University of California at Los Angeles campus, we developed a predictive scheduling framework, including a predictive control paradigm and a kernel-based session parameter estimator. Specifically, the scheduling service periodically computes for cost-efficient solutions, considering the predicted session parameters, by the adaptive kernel-based estimator with improved estimation accuracies. We also consider the power sharing strategy of existing EVSEs and formulate the virtual load constraint to handle the future EV arrivals with unexpected energy demand. To validate the proposed framework, 20-fold cross validation is performed on the historical dataset of charging behaviors for over one-year period. The simulation results demonstrate that average unit energy cost per kWh can be reduced by 29.42% with the proposed scheduling framework and 66.71% by further integrating solar generations with the given capacity, after the initial infrastructure investment. The effectiveness of kernel-based estimator, virtual load constraint, and event-based control scheme are also discussed in detail.

Weiqun Zhang, Ann Almgren, Marcus Day, Tan Nguyen, John Shalf, Didem Unat, "BoxLib with Tiling: An AMR Software Framework", SIAM Journal on Scientific Computing, 2016,

Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf, "OpenSoC Fabric: On-Chip Network Generator", ISPASS 2016: International Symposium on Performance Analysis of Systems and Software, IEEE, April 2016, 194-203, doi: 10.1109/ISPASS.2016.7482094


D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690


Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf, "OpenSoC Fabric: On-Chip Network Generator", Proceedings of the Workshop on Network on Chip Architectures, ACM, December 2014, 45-50, LBNL LBNL-1005675, doi: 10.1145/2685342.2685351

J.A. Ang, R.F. Barrett, R.E. Benner, D. Burke, C. Chan, D. Donofrio, S.D. Hammond, K.S. Hemmert, S.M. Kelly, H. Le, V.J. Leung, D.R. Resnick, A.F. Rodrigues, J. Shalf, D. Stark, D. Unat, N.J. Wright, "Abstract Machine Models and Proxy Architectures for Exascale Computing", 2014 Hardware-Software Co-Design for High Performance Computing, November 17, 2014,

J.A. Ang, R.F. Barrett, R.E. Benner, D. Burke, C. Chan, D. Donofrio, S.D. Hammond, K.S. Hemmertand S.M. Kelly, H. Le, V.J. Leung, D.R. Resnick, A.F. Rodrigues, J. Shalf, D. Stark, andN.J. Wright D. Unat, "Abstract Machine Models and Proxy Architectures for Exascale Computing", Co--HPC2014 (to appear), New Orleans, LA, USA, IEEE Computer Society, November 17, 2014,

To achieve Exascale computing, fundamental hardware architectures must change. The most significant consequence of this assertion is the impact on the scientific applications that run on current High Performance Computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to Exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the Exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to ellucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware architects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of models that can help software developers prepare for Exascale and through the use of proxy architectures, we can enable a more concrete exploration of how well application codes map onto the future architectures. 

George Michelogiannakis, John shalf, "Variable-Width Datapath for On-Chip Network Static Power Reduction", 8th International Symposium on Networks-on-Chip (NOCS), September 2014,

  • Download File: abn.pdf (pdf: 277 KB)

George Michelogiannakis, John Shalf, Variable-Width Datapath for On-Chip Network Static Power Reduction, 8th International Symposium on Networks-on-Chip, September 2014,

Didem Unat, George Michelogiannakis, John Shalf, The Role of Modeling in Locality Optimizations, Modeling and simulation workshop (MODSIM), August 2014,

George Michelogiannakis, Collective Memory Transfers for Multi-Core Chips, International Conference on Supercomputing (ICS), June 2014,


George Michelogiannakis, Channel Reservation Protocol for Over-Subscribed Channels and Destinations, Conference on High Performance Computing Networking, Storage and Analysis, 2013,

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Channel Reservation Protocol for Over-Subscribed Channels and Destinations", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2013,

George Michelogiannakis, Hardware Support for Collective Memory Transfers in Stencil Computations, Workshop on Optimizing Stencil Computations, October 2013,

George Michelogiannakis, Extending Summation Precision for Distributed Network Operations, 25th International Symposium on Computer Architecture and High Performance Computing, October 2013,

George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems. 

Cy Chan, Didem Unat, Michael Lijewski, Weiqun Zhang, John Bell, John Shalf, "Software Design Space Exploration for Exascale Combustion Co-Design", International Supercomputing Conference (ISC), Leipzig, Germany, June 16, 2013,

Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, John Kim, William J. Dally, "A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator", International Symposium on Performance Analysis of Systems and Software, IEEE Computer Society, April 2013,

D Unat, CP Chan, W Zhang, J Bell, J Shalf, "Tiling as a Durable Abstraction for Parallelism and Data Locality", WOLFHPC 2013 - SC13 Workshop on Domain-Specific Languages and High-Level Frameworks for High-Performance Computing, 2013,

George Michelogiannakis, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", IEEE Transactions on Computers, 2013,

Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks, EB networks provide an up to 45% shorter cycle time, 16% more throughput per unit power or 22% more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21% more throughput per unit power than VC networks and 12% than EB networks.

Cy Chan, Joseph Kenny, Gilbert Hendry, Didem Unat, Vincent Beckner, John Bell and John Shalf,, "An AMR Computation and Communication Dependency and Analysis Methodology", IA^3 2013 - SC13 Workshop on Irregular Applications: Architectures and Algorithms, Denver, CO, January 1, 2013,


Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally, "Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks", International Conference on Computer Design, IEEE Computer Society, 2012,

This paper introduces Adaptive Backpressure, a novel scheme that improves the utilization of dynamically man- aged router input buffers by continuously adjusting the stiffness of the flow control feedback loop in response to observed traffic conditions. Through a simple extension to the router’s flow control mechanism, the proposed scheme heuristically limits the number of credits available to individual virtual channels based on estimated downstream congestion, aiming to minimize the amount of buffer space that is occupied unproductively. This leads to more efficient distribution of buffer space and improves isolation between multiple concurrently executing workloads with differing performance characteristics.

Experimental results for a 64-node mesh network show that Adaptive Backpressure improves network stability, leading to an average 2.6× increase in throughput under heavy load across traffic patterns. In the presence of background traffic, the pro- posed scheme reduces zero-load latency by an average of 31 %. Finally, it mitigates the performance degradation encountered when latency- and throughput-optimized execution cores contend for network resources in a heterogeneous chip multi-processor; across a set of PARSEC benchmarks, we observe an average reduction in execution time of 34%.

Jiaoyan Chen, Emanuel Popovici, Dilip Vasudevan, Michel Schellekens, "Ultra Low Power Booth Multiplier Using Asynchronous Logic", 2012 18th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), Lyngby, Denmark, IEEE, July 19, 2012, 81 - 88, doi: 10.1109/ASYNC.2012.15

Asynchronous logic shows promising applicability in ASIC design due to its potentially low power and high robustness properties. For deep submicron technologies the static power is becoming very significant and many applications require that this power component to be reduced. A new logic called Positive Feedback Charge Sharing Logic (PFCSL) is proposed, which reduces both dynamic and especially static power and also could be implemented with asynchronous logic. This new logic combines adiabatic logic with charge sharing technology avoiding the penalty of power clock generator. A novel 16-by-16-bit Radix-4 Booth Multiplier is built based on PFCSL and implemented in 45nm technology. We achieve around 30% reduction in dynamic power and 60% in static power respectively compared to the same design being implemented using static dual-rail logic. Also, the area of the multiplier is significantly smaller.

Energy-Efficient Flow-Control for On-Chip Networks, George Michelogiannakis, Stanford University, 2012,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control has been proposed to address this issue by removing router buffers and handling contention by dropping or deflecting flits. In this thesis, we compare virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our study shows that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains. To provide buffering in the network but without the cost and timing overhead of router buffers, we propose elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers as well as the complexity for virtual channels (VCs) are no longer required. Therefore, EB networks have a shorter cycle time and offer more throughput per unit power than VC networks. We also propose a hybrid EB-VC router which is used to provide traffic separation for a number of traffic classes large enough for duplicate physical channels to be inefficient. These hybrid routers offer more throughput per unit power than both EB and VC routers. Finally, this thesis proposes packet chaining, which addresses the tradeoff between allocation quality and cycle time traditionally present in routers with VCs. Packet chaining is a simple and effective method to increase allocator matching efficiency to be comparable or superior to more complex and slower allocators without extending cycle time, particularly suited to networks with short packets.

Nan Jiang, Daniel U. Becker, George Michelogiannakis, William J. Dally, "Network Congestion Avoidance through Speculative Reservation", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2012,

Congestion caused by hot-spot traffic can significantly degrade the performance of a computer network. In this study, we present the Speculative Reservation Protocol (SRP), a new network congestion control mechanism that relieves the effect of hot-spot traffic in high bandwidth, low latency, lossless computer networks. Compared to existing congestion control approaches like Explicit Congestion Notification (ECN), which react to network congestion through packet marking and rate throttling, SRP takes a proactive approach of congestion avoidance. Using a light-weight endpoint reservation scheme and speculative packet transmission, SRP avoids hot-spot congestion while incurring minimal overhead. Our simulation results show that SRP responds more rapidly to the onset of severe hot-spots than ECN and has a higher network throughput on bursty network traffic. SRP also performs comparably to networks without congestion control on benign traffic patterns by reducing the latency and throughput overhead commonly associated with reservation protocols.


George Michelogiannakis, Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks, International Symposium on Microarchitecture, 2011,

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks", International Symposium on Microarchitecture, ACM, 2011,

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a highly-tuned router using a conventional single-iteration separable iSLIP allocator, and outperforms significantly more complex allocators. Specifically, it outperforms multiple-iteration iSLIP allocators and wavefront allocators by 10% and 6% respectively, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5% compared to a single-iteration iSLIP allocator. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent chip multiprocessor.

M. Wehner, L. Oliker, J. Shalf, D. Donofrio, L. Drummond, et al., "Hardware/Software Co-design of Global Cloud System Resolving Models", Journal of Advances in Modeling Earth Systems (JAMES), 2011, 3, M1000:22, doi: 10.1029/2011MS000073

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482


David Donofrio, Leonid Oliker, John Shalf, Michael F. Wehner, Daniel Burke, John Wawrzynek, "Project Green Flash---Design and Emulate A Low-‐Power CPU for a New Climate-‐Modeling Supercomputer", Design Automation Conference (DAC47), 2010,

George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis, "Evaluating Bufferless Flow Control for On-chip Networks", International Symposium on Networks-on-Chip, IEEE Computer Society, 2010,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control addresses this issue by removing router buffers, and handles contention by dropping or deflecting flits. This work compares virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our evaluation includes optimizations for both schemes: buffered networks use custom SRAM-based buffers and empty buffer bypassing for energy efficiency, while bufferless networks feature a novel routing scheme that reduces average latency by 5%. Results show that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains: bufferless designs are only marginally (up to 1.5%) more energy efficient at very light loads, and buffered networks provide lower latency and higher throughput per unit power under most conditions.

George Michelogiannakis, Evaluating Bufferless Flow Control for On-chip Networks, International Symposium on Networks-on-Chip, 2010,

Daniel Sanchez, George Michelogiannakis, Christos Kozyrakis, "An Analysis of Interconnection Networks for Large Scale Chip Multiprocessors", ACM Transactions on Architecture and Code Optimization, 2010,

With the number of cores of chip multiprocessors (CMPs) rapidly growing as technology scales down, connecting the different components of a CMP in a scalable and efficient way becomes increasingly challenging. In this article, we explore the architectural-level implications of interconnection network design for CMPs with up to 128 fine-grain multithreaded cores. We evaluate and compare different network topologies using accurate simulation of the full chip, including the memory hierarchy and interconnect, and using a diverse set of scientific and engineering workloads.

We find that the interconnect has a large impact on performance, as it is responsible for 60% to 75% of the miss latency. Latency, and not bandwidth, is the primary performance constraint, since, even with many threads per core and workloads with high miss rates, networks with enough bandwidth can be efficiently implemented for the system scales we consider. From the topologies we study, the flattened butterfly consistently outperforms the mesh and fat tree on all workloads, leading to performance advantages of up to 22%. We also show that considering interconnect and memory hierarchy together when designing large-scale CMPs is crucial, and neglecting either of the two can lead to incorrect conclusions. Finally, the effect of the interconnect on overall performance becomes more important as the number of cores increases, making interconnection choices especially critical when scaling up.

John Shalf, Donofrio, Rowen, Oliker, Michael F. Wehner, "Green Flash: Climate Machine (LBNL)", Encyclopedia of Parallel Computing, (Springer: 2010) Pages: 809-819

Green Flash is a research project focused on an application-driven manycore chip design that leverages commodity-embedded circuit designs and hardware/software codesign processes to create a highly programmable and energy-efficient HPC design. The project demonstrates how a multidisciplinary hardware/software codesign process that facilitates close interactions between applications scientists, computer scientists, and hardware engineers can be used to develop a system tailored for the requirements of scientific computing.


George Michelogiannakis, William J. Dally, "Router Designs for Elastic Buffer On-Chip Networks", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2009,

This paper explores the design space of elastic buffer (EB) routers by evaluating three representative designs. We propose an enhanced two-stage EB router which maximizes throughput by achieving a 42% reduction in cycle time and 20% reduction in occupied area by using look-ahead routing and replacing the three-slot output EBs in the baseline router of [17] with two-slot EBs. We also propose a singlestage router which merges the two pipeline stages to avoid pipelining overhead. This design reduces zero-load latency by 24% compared to the enhanced two-stage router if both are operated at the same clock frequency; moreover, the single-stage router reduces the required energy per transferred bit and occupied area by 29% and 30% respectively, compared to the enhanced two-stage router. However, the cycle time of the enhanced two-stage router is 26% smaller than that of the single-stage router.

George Michelogiannakis, James Balfour, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2009,

This paper presents elastic buffers (EBs), an efficient flow-control scheme that uses the storage already present in pipelined channels in place of explicit input virtual-channel buffers (VCBs). With this approach, the channels themselves act as distributed FIFO buffers. Without VCBs, and hence virtual channels (VCs), deadlock prevention is achieved by duplicating physical channels. We develop a channel occupancy detector to apply universal globally adaptive load-balancing (UGAL) routing to load balance traffic in networks using EBs. Using EBs results in up to 8% (12% for low-swing channels) improvement in peak throughput per unit power compared to a VC flow-control network. These gains allow for a wider network datapath to be used to offset the removal of VCBs and increase throughput for a fixed power budget. EB networks have identical zero-load latency to VC networks operating under the same frequency. The microarchitecture of an EB router is considerably simpler than a VC router because allocators and credits are not required. For 5 times 5 mesh routers, this results in an 18% improvement in the cycle time.

George Michelogiannakis, Elastic Buffer Flow Control for On-Chip Networks, International Symposium on High Performance Computer Architecture, 2009,

G. Hendry, S.A. Kamil, A. Biberman, J. Chan, B.G. Lee, M Mohiyuddin, A. Jain, K. Bergman, L.P. Carloni, J. Kubiatocics, L. Oliker, J. Shalf, "Analysis of Photonic Networks for Chip Multiprocessor Using Scientific Applications", International Symposium on Networks-on-Chip (NOCS), 2009,

J Gebis, L Oliker, J Shalf, S Williams, K Yelick, "Improving memory subsystem performance using ViVA: Virtual vector architecture", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009, 5455 LNC:146--158, doi: 10.1007/978-3-642-00454-4_16

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

S. Kamil, L. Oliker, A. Pinar, J. Shalf, "Communication Requirements and Interconnect Optimization for High-End Scientific Applications\", IEEE Transactions on Parallel and Distributed Systems (TPDS), 2009,

John Shalf and Jason Hick (Arie Shoshani and Doron Rotem), "Storage Technology Fundamentals", Scientific Data Management: Challenges, Technology, and Deployment, Volume . Chapman & Hall/CRC, 2009,

S. Amarasinghe, D. Campbell, W. Carlson, A. Chien, W. Dally, E. Elnohazy, M. Hall, R. Harrison, W. Harrod, K. Hill, J. Hiller, S. Karp, C. Koelbel, D. Koester, P. Kogge, J. Levesque, D. Reed, V. Sarkar, R. Schreiber, M. Richards, A. Scarpelli, J. Shalf , A. Snavely, T. Sterling, "ExaScale Software Study: Software Challenges in Extreme Scale Systems", 2009,

John Shalf, Thomas Sterling, "Operating Systems For Exascale Computing", 2009,

John Shalf, Erik Schnetter, Gabrielle Allen, Edward Seidel, Cactus and the Role of Frameworks in Complex Multiphysics HPC Applications, 2009,

John Shalf, Auto-Tuning: The Big Questions (Panel), 2009,

John Shalf, David Donofrio, Green Flash: Extreme Scale Computing on a Petascale Budget, 2009,

John Shalf, Challenges of Energy Efficient Scientific Computing, 2009,

John Shalf, Harvey Wasserman, Breakthrough Computing in Petascale Applications and Petascale System Examples at NERSC, 2009,

John Shalf, Satoshi Matsuoka, IESP Power Efficiency Research Priorities, 2009,

Brian van Straalen, Shalf, J. Ligocki, Keen, Woo-Sun Yang, "Scalability challenges for massively parallel AMR applications", IPDPS, 2009, 1-12,

David Donofrio, Oliker, Shalf, F. Wehner, Rowen, Krueger, Kamil, Marghoob Mohiyuddin, "Energy-Efficient Computing for Extreme-Scale Science", IEEE Computer, January 2009, 42:62-71, doi: 10.1109/MC.2009.35




Cy Chan, Shoaib Kamil, John Shalf, Generalized Multicore Autotuning for Stencil-based PDE Solvers, Lawrence Berkeley National Laboratory, August 21, 2008,

Shoaib Kamil, Shalf, Erich Strohmaier, "Power efficiency in high performance computing", IPDPS, 2008, 1-8,

Hongzhang Shan, Antypas, John Shalf, "Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark", SC, 2008, 42,

Shantenu Jha, Hartmut Kaiser, Andre Merzky, John Shalf, "SAGA - The Simple API for Grid Applications - Motivation, Design, and Implementation", Encyclopedia of Grid Technologies and Applications, Volume 1. Information Science Reference (, 2008,

Antypas, K., Shalf, J., and Wasserman, H., "NERSC-6 Workload Analysis and Benchmark Selection Process", 2008, LBNL 1014E,

John Shalf, NERSC User IO Cases, 2008,

Antypas, K. Shalf, J., and Wasserman, H., Recent Workload Characterization Activities at NERSC, 2008,

John Shalf, Neuroinformatics Congress: Future Hardware Challenges for Scientific Computing, 2008,

M. Wehner, L. Oliker, J. Shalf, Ultra-Efficient Exascale Scientific Computing, 2008,


Vassilis Papaefstathiou, Dionisios Pnevmatikatos, Manolis Marazakis, Giorgos Kalokairinos, Aggelos Ioannou, Michael Papamichael, Stamatis Kavadias, George Michelogiannakis, Manolis Katevenis, "Prototyping Efficient Interprocessor Communication Mechanics", International Conference on Embedded Computer Systems: Architectures, Modelling and Simulations, IEEE Computer Society, 2007,

Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapidsystemprototypingbecomesimportantindesigningand evaluating their architecture. We present an efficient FPGA- based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.

Approaching Ideal NoC Latency with Pre-Configured Routes, George Michelogiannakis, University of Crete, 2007,

George Michelogiannakis, Dionisios Pnevmatikatos, Manolis Katevenis, "Approaching Ideal NoC Latency with Pre-Configured Routes", First International Symposium on Networks-on-Chip, IEEE Computer Society, 2007,

In multi-core ASICs, processors and other compute engines need to communicate with memory blocks and other cores with latency as close as possible to the ideal of a direct buffered wire. However, current state of the art networks-on- chip (NoCs) suffer, at best, latency of one clock cycle per hop. We investigate the design of a NoC that offers close to the ideal latency in some preferred, run-time configurable paths. Processors and other compute engines may perform network reconfiguration to guarantee low latency over different sets of paths as needed. Flits in non-preferred paths are given lower priority than flits in preferred ones, and suffer a delay of one clock cycle per hop when there is no contention. To achieve our goal, we use the "madpostman" [5] technique: every incoming flit is eagerly (i.e. speculatively) forwarded to the input's preferred output, if any. This is accomplished with the mere delay of a single pre-enabled tri-state driver. We later check if that decision was correct, and if not, we forward the flit to the proper output. Incorrectly forwarded flits are classified as dead and eliminated in later hops. We use a 2D mesh topology tailored for processor-memory communication, and a modified version of XY routing that remains deadlock-free. Our evaluation shows that, for the preferred paths, our approach offers typical latency around 500 ps versus 1500 ps for a full clock cycle or 135 ps for an ideal direct connect, in a 130 nm technology; non-preferred paths suffer a one clock cycle delay per hop, similar to that of other approaches. Performance gains are significant and can be proven greatly useful in other application domains as well.

John Shalf, NERSC Workload Analysis, 2007,

John Shalf, NERSC Power Efficiency Analysis., 2007,

Shoaib Kamil, John Shalf, Power Efficiency Metrics for the Top500, 2007,

John Shalf, Petascale Computing Application Challenges., 2007,

John Shalf, Power, Cooling, and Energy Consumption for the Petascale and Beyond., 2007,

John Shalf, Shoaib Kamil, David Bailey, Erich Strohmaier, Power Efficiency and the Top500, 2007,

Shoaib Kamil, Pinar, Gunter, Lijewski, Oliker, John Shalf, "Reconfigurable hybrid interconnection for static and dynamic scientific applications", Conf. Computing Frontiers, 2007, 183-194, LBNL 60060,

John Shalf, Honzhang Shan, User Perspective on HPC I/O Requirements., 2007,

John Shalf, Shoaib Kamil, David Skinner, Leonid Oliker, Interconnect Requirements for HPC Applications, 2007,

John Shalf, Overturning the Conventional Wisdom for the Multicore Era: Everything You Know is Wrong, 2007,

John Shalf, The Landscape of Parallel Computing Architecture., 2007,

John Shalf, About Memory Bandwidth and Multicore, 2007,

John Shalf, Landscape of Computing Architecture: Introduction to the "Berkeley View, 2007,

John Shalf, Memory Subsystem Performance and QuadCore Predictions, 2007,

Shoaib Kamil, John Shalf, "Measuring Power Efficiency of NERSC's Newest Flagship Machine", 2007,

John Shalf, "The New Landscape of Parallel Computer Architecture", Journal of Physics: Conference Series, Volume . IOP Electronics Journals, 2007,

Hongzhang Shan and John Shalf, "Using IOR to Analyze the I/O performance for HPC Platforms",, 2007, LBNL 62647,


Tom Goodale, Shantenu Jha, Hartmut Kaiser, Thilo Kielmann, Pascal Kleijer, Gregor von Laszewski, Craig Lee, Andre Merzky, Hrabri Rajic, Hrabri, John Shalf, "SAGA: A Simple API for Grid Applications -- High-Level Application Programming on the Grid", Computational Methods in Science and Technology, Volume 12(1). Poznan, 2006, LBNL 59066,

John Shalf, David Bailey, Top500 Power Efficiency, 2006,

Hongzhang Shan, John Shalf, "Analysis of Parallel IO on Modern HPC Platforms", 2006,

  • Download File: IOR.doc (doc: 399 KB)

Analysis of the parallel IO requirements from a number of HPC applications, combined with microbenchmarks to aid in understanding their performance.


IPIF to PCI bridge specification, George Michelogiannakis, University of Crete, 2005,

John Shalf, Kamil, Oliker, David Skinner, "Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2005, 17,

Horst Simon, William Kramer, William Saphir, John Shalf, David Bailey, Leonid Oliker, Michael Banda, C. William McCurdy, John Hules, Andrew Canning, Marc Day, Philip Colella, David Serafini, Michael Wehner, Peter Nugent, "Science-Driven System Architecture: A New Process for Leadership Class Computing", Journal of the Earth Simulator, Volume 2., 2005, LBNL 56545,