Careers | Phone Book | A - Z Index

George Michelogiannakis

georgem
George Michelogiannakis
Research Scientist
Computational Research Division
Phone: 510-495-2011
Lawrence Berkeley National Laboratory
One Cyclotron Rd
Berkeley, CA 94720 US

Biographical Sketch

George Michelogiannakis is a research scientist in the computer architecture group (CAG) in the computational research division (CRD). He has extensive work on networking (both off- and on-chip) and computer architecture. His latest work focuses on the post Moore's law era looking into specialization, emerging devices (transistors), memories, photonics, and 3D integration. He is also currently working on optics and architecture for HPC and datacenter networks.

Journal Articles

George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", Springer International Journal of Parallel Programming, December 2015, 43:6:1218-1243, doi: 10.1007/s10766-014-0326-5

George Michelogiannakis, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", Transactions on Computers, 2013,

Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks, EB networks provide an up to 45% shorter cycle time, 16% more throughput per unit power or 22% more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21% more throughput per unit power than VC networks and 12% than EB networks.

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Packet Chaining: Efficient Single-Cycle Allocation for On-Chip Networks", Computer Architecture Letters, July 1, 2011,

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles, like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a conventional single-iteration separable iSLIP allocator, outperforms a wavefront allocator, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5%. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent CMP. These are considerable improvements given the maturity of network-on-chip routers and allocators.

George Michelogiannakis, Daniel U. Becker, William J. Dally, "Evaluating Elastic Buffer and Wormhole Flow Control", Transactions on Computers, 2011,

With the emergence of on-chip networks, router buffer power has become a primary concern. Elastic buffer (EB) flow control utilizes existing pipeline flip-flops in the channels to implement distributed FIFOs, eliminating the need for input buffers at the routers. EB routers have been shown to be more efficient than virtual channel routers, as they do not require input buffers or complex logic for managing virtual channels and tracking credits. Wormhole routers are more comparable in terms of complexity because they also lack virtual channels. This paper compares EB and wormhole routers and explores novel hybrid designs to more closely examine the effect of design simplicity and input buffer cost. Our results show that EB routers have up to 25 percent smaller cycle time compared to wormhole and hybrid routers. Moreover, EB flow control requires 10 percent less energy to transfer a single bit through a router and offers three percent more throughput per unit energy as well as 62 percent more throughput per unit area. The main contributor to these results is the cost and delay overhead of the input buffer.

Daniel Sanchez, George Michelogiannakis, Christos Kozyrakis, "An Analysis of Interconnection Networks for Large Scale Chip Multiprocessors", Transactions on Architecture and Code Optimization, 2010,

With the number of cores of chip multiprocessors (CMPs) rapidly growing as technology scales down, connecting the different components of a CMP in a scalable and efficient way becomes increasingly challenging. In this article, we explore the architectural-level implications of interconnection network design for CMPs with up to 128 fine-grain multithreaded cores. We evaluate and compare different network topologies using accurate simulation of the full chip, including the memory hierarchy and interconnect, and using a diverse set of scientific and engineering workloads.

We find that the interconnect has a large impact on performance, as it is responsible for 60% to 75% of the miss latency. Latency, and not bandwidth, is the primary performance constraint, since, even with many threads per core and workloads with high miss rates, networks with enough bandwidth can be efficiently implemented for the system scales we consider. From the topologies we study, the flattened butterfly consistently outperforms the mesh and fat tree on all workloads, leading to performance advantages of up to 22%. We also show that considering interconnect and memory hierarchy together when designing large-scale CMPs is crucial, and neglecting either of the two can lead to incorrect conclusions. Finally, the effect of the interconnect on overall performance becomes more important as the number of cores increases, making interconnection choices especially critical when scaling up.

Conference Papers

George Michelogiannakis, John Shalf, "Last Level Collective Hardware Prefetching For Data-Parallel Applications", IEEE 24th International Conference on High Performance Computing, IEEE, December 2017,

Dilip Vasudevan, George Michelogiannakis, John Shalf, "CASPER - Configurable Design Space Exploration of Programmable Architectures for Machine Learning using Beyond Moore Devices", IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), July 2017,

Dilip Vasudevan, Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf, "Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies", Workshop on HPC computing in a Post Moore’s law world (HCPM), June 22, 2017,

With the decline and eventual end of historical rates of lithographic scaling, we arrive at a crossroad where synergistic and holistic decisions are required to preserve Moore's law technology scaling. Numerous emerging technologies aim to extend digital electronics scaling of performance, energy efficiency, and computational power/density,
ranging from devices (transistors), memories, 3D integration capabilities, specialized architectures, photonics, and others.
The wide range of technology options creates the need for an integrated strategy to understand the impact of these emerging technologies on future large-scale digital systems for diverse application requirements and optimization metrics.
In this paper, we argue for a comprehensive methodology that spans the different levels of abstraction -- from materials, to devices, to complex digital systems and applications. Our approach integrates compact models of low-level characteristics of the emerging technologies to inform higher-level simulation models to evaluate their responsiveness to application requirements.
The integrated framework can then automate the search for an optimal architecture using available emerging technologies to maximize a targeted optimization metric.

George Michelogiannakis, Khaled Z. Ibrahim, John Shalf, Jeremiah J. Wilke, Samuel Knight, Joseph P. Kenny, "APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks", 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE, May 2017, LBNL 1007126,

George Michelogiannakis, Dave Donofrio, John Shalf, "Modeling of Novel Transistors, Manufacturing Technologies, and Architectures to Preserve Digital Computing Performance Scaling", 1ST INTERNATIONAL WORKSHOP ON POST-MOORE’S ERA SUPERCOMPUTING (PMES), November 2016,

Didem Unat, Tan Nguyen, Weiqun Zhang, Muhammed Nufail Farooqi, Burak Bastem, George Michelogiannakis, Ann Almgren, John Shalf, "TiDA: High-Level Programming Abstractions for Data Locality Management", ISC 2016: International Supercomputing conference, June 2016,

Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf, "OpenSoC Fabric: On-Chip Network Generator", ISPASS 2016: International Symposium on Performance Analysis of Systems and Software, IEEE, April 2016,

Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf, "OpenSoC Fabric: On-Chip Network Generator", Proceedings of the Workshop on Network on Chip Architectures, ACM, December 2014, 45-50, LBNL LBNL-1005675, doi: 10.1145/2685342.2685351

George Michelogiannakis, John shalf, "Variable-Width Datapath for On-Chip Network Static Power Reduction", 8th International Symposium on Networks-on-Chip (NOCS), September 2014,

  • Download File: abn.pdf (pdf: 277 KB)

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf, "Collective Memory Transfers for Multi-Core Chips", International Conference on Supercomputing (ICS), June 2014, doi: 10.1145/2597652.2597654

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Channel Reservation Protocol for Over-Subscribed Channels and Destinations", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2013,

George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems. 

Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, John Kim, William J. Dally, "A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator", International Symposium on Performance Analysis of Systems and Software, IEEE Computer Society, April 2013,

Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally, "Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks", International Conference on Computer Design, IEEE Computer Society, 2012,

This paper introduces Adaptive Backpressure, a novel scheme that improves the utilization of dynamically man- aged router input buffers by continuously adjusting the stiffness of the flow control feedback loop in response to observed traffic conditions. Through a simple extension to the router’s flow control mechanism, the proposed scheme heuristically limits the number of credits available to individual virtual channels based on estimated downstream congestion, aiming to minimize the amount of buffer space that is occupied unproductively. This leads to more efficient distribution of buffer space and improves isolation between multiple concurrently executing workloads with differing performance characteristics.

Experimental results for a 64-node mesh network show that Adaptive Backpressure improves network stability, leading to an average 2.6× increase in throughput under heavy load across traffic patterns. In the presence of background traffic, the pro- posed scheme reduces zero-load latency by an average of 31 %. Finally, it mitigates the performance degradation encountered when latency- and throughput-optimized execution cores contend for network resources in a heterogeneous chip multi-processor; across a set of PARSEC benchmarks, we observe an average reduction in execution time of 34%.

Nan Jiang, Daniel U. Becker, George Michelogiannakis, William J. Dally, "Network Congestion Avoidance through Speculative Reservation", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2012,

Congestion caused by hot-spot traffic can significantly degrade the performance of a computer network. In this study, we present the Speculative Reservation Protocol (SRP), a new network congestion control mechanism that relieves the effect of hot-spot traffic in high bandwidth, low latency, lossless computer networks. Compared to existing congestion control approaches like Explicit Congestion Notification (ECN), which react to network congestion through packet marking and rate throttling, SRP takes a proactive approach of congestion avoidance. Using a light-weight endpoint reservation scheme and speculative packet transmission, SRP avoids hot-spot congestion while incurring minimal overhead. Our simulation results show that SRP responds more rapidly to the onset of severe hot-spots than ECN and has a higher network throughput on bursty network traffic. SRP also performs comparably to networks without congestion control on benign traffic patterns by reducing the latency and throughput overhead commonly associated with reservation protocols.

George Michelogiannakis, Nan Jiang, Daniel U. Becker, William J. Dally, "Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks", International Symposium on Microarchitecture, ACM, 2011,

This paper introduces packet chaining, a simple and effective method to increase allocator matching efficiency and hence network performance, particularly suited to networks with short packets and short cycle times. Packet chaining operates by chaining packets destined to the same output together, to reuse the switch connection of a departing packet. This allows an allocator to build up an efficient matching over a number of cycles like incremental allocation, but not limited by packet length. For a 64-node 2D mesh at maximum injection rate and with single-flit packets, packet chaining increases network throughput by 15% compared to a highly-tuned router using a conventional single-iteration separable iSLIP allocator, and outperforms significantly more complex allocators. Specifically, it outperforms multiple-iteration iSLIP allocators and wavefront allocators by 10% and 6% respectively, and gives comparable throughput with an augmenting paths allocator. Packet chaining achieves this performance with a cycle time comparable to a single-iteration separable allocator. Packet chaining also reduces average network latency by 22.5% compared to a single-iteration iSLIP allocator. Finally, packet chaining increases IPC up to 46% (16% average) for application benchmarks because short packets are critical in a typical cache-coherent chip multiprocessor.

George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis, "Evaluating Bufferless Flow Control for On-chip Networks", International Symposium on Networks-on-Chip, IEEE Computer Society, 2010,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control addresses this issue by removing router buffers, and handles contention by dropping or deflecting flits. This work compares virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our evaluation includes optimizations for both schemes: buffered networks use custom SRAM-based buffers and empty buffer bypassing for energy efficiency, while bufferless networks feature a novel routing scheme that reduces average latency by 5%. Results show that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains: bufferless designs are only marginally (up to 1.5%) more energy efficient at very light loads, and buffered networks provide lower latency and higher throughput per unit power under most conditions.

George Michelogiannakis, William J. Dally, "Router Designs for Elastic Buffer On-Chip Networks", Conference on High Performance Computing Networking, Storage and Analysis, ACM, 2009,

This paper explores the design space of elastic buffer (EB) routers by evaluating three representative designs. We propose an enhanced two-stage EB router which maximizes throughput by achieving a 42% reduction in cycle time and 20% reduction in occupied area by using look-ahead routing and replacing the three-slot output EBs in the baseline router of [17] with two-slot EBs. We also propose a singlestage router which merges the two pipeline stages to avoid pipelining overhead. This design reduces zero-load latency by 24% compared to the enhanced two-stage router if both are operated at the same clock frequency; moreover, the single-stage router reduces the required energy per transferred bit and occupied area by 29% and 30% respectively, compared to the enhanced two-stage router. However, the cycle time of the enhanced two-stage router is 26% smaller than that of the single-stage router.

George Michelogiannakis, James Balfour, William J. Dally, "Elastic Buffer Flow Control for On-Chip Networks", International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2009,

This paper presents elastic buffers (EBs), an efficient flow-control scheme that uses the storage already present in pipelined channels in place of explicit input virtual-channel buffers (VCBs). With this approach, the channels themselves act as distributed FIFO buffers. Without VCBs, and hence virtual channels (VCs), deadlock prevention is achieved by duplicating physical channels. We develop a channel occupancy detector to apply universal globally adaptive load-balancing (UGAL) routing to load balance traffic in networks using EBs. Using EBs results in up to 8% (12% for low-swing channels) improvement in peak throughput per unit power compared to a VC flow-control network. These gains allow for a wider network datapath to be used to offset the removal of VCBs and increase throughput for a fixed power budget. EB networks have identical zero-load latency to VC networks operating under the same frequency. The microarchitecture of an EB router is considerably simpler than a VC router because allocators and credits are not required. For 5 times 5 mesh routers, this results in an 18% improvement in the cycle time.

Vassilis Papaefstathiou, Dionisios Pnevmatikatos, Manolis Marazakis, Giorgos Kalokairinos, Aggelos Ioannou, Michael Papamichael, Stamatis Kavadias, George Michelogiannakis, Manolis Katevenis, "Prototyping Efficient Interprocessor Communication Mechanics", International Conference on Embedded Computer Systems: Architectures, Modelling and Simulations, IEEE Computer Society, 2007,

Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapidsystemprototypingbecomesimportantindesigningand evaluating their architecture. We present an efficient FPGA- based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.

George Michelogiannakis, Dionisios Pnevmatikatos, Manolis Katevenis, "Approaching Ideal NoC Latency with Pre-Configured Routes", First International Symposium on Networks-on-Chip, IEEE Computer Society, 2007,

In multi-core ASICs, processors and other compute engines need to communicate with memory blocks and other cores with latency as close as possible to the ideal of a direct buffered wire. However, current state of the art networks-on- chip (NoCs) suffer, at best, latency of one clock cycle per hop. We investigate the design of a NoC that offers close to the ideal latency in some preferred, run-time configurable paths. Processors and other compute engines may perform network reconfiguration to guarantee low latency over different sets of paths as needed. Flits in non-preferred paths are given lower priority than flits in preferred ones, and suffer a delay of one clock cycle per hop when there is no contention. To achieve our goal, we use the "madpostman" [5] technique: every incoming flit is eagerly (i.e. speculatively) forwarded to the input's preferred output, if any. This is accomplished with the mere delay of a single pre-enabled tri-state driver. We later check if that decision was correct, and if not, we forward the flit to the proper output. Incorrectly forwarded flits are classified as dead and eliminated in later hops. We use a 2D mesh topology tailored for processor-memory communication, and a modified version of XY routing that remains deadlock-free. Our evaluation shows that, for the preferred paths, our approach offers typical latency around 500 ps versus 1500 ps for a full clock cycle or 135 ps for an ideal direct connect, in a 130 nm technology; non-preferred paths suffer a one clock cycle delay per hop, similar to that of other approaches. Performance gains are significant and can be proven greatly useful in other application domains as well.

Presentation/Talks

George Michelogiannakis, David Donofrio, John Shalf, Modeling of Novel Transistors, Manufacturing Technologies, and Architectures to Preserve Digital Computing Performance Scaling, Post-Moore's Era Supercomputing (PMES) Workshop, November 2016,

George Michelogiannakis, John Shalf, Variable-Width Datapath for On-Chip Network Static Power Reduction, 8th International Symposium on Networks-on-Chip, September 2014,

Didem Unat, George Michelogiannakis, John Shalf, The Role of Modeling in Locality Optimizations, Modeling and simulation workshop (MODSIM), August 2014,

George Michelogiannakis, Collective Memory Transfers for Multi-Core Chips, International Conference on Supercomputing (ICS), June 2014,

George Michelogiannakis, Channel Reservation Protocol for Over-Subscribed Channels and Destinations, Conference on High Performance Computing Networking, Storage and Analysis, 2013,

George Michelogiannakis, Hardware Support for Collective Memory Transfers in Stencil Computations, Workshop on Optimizing Stencil Computations, October 2013,

George Michelogiannakis, Extending Summation Precision for Distributed Network Operations, 25th International Symposium on Computer Architecture and High Performance Computing, October 2013,

George Michelogiannakis, Packet Chaining: Efficient Single-Cycle Allocation for On-Chip networks, International Symposium on Microarchitecture, 2011,

George Michelogiannakis, Evaluating Bufferless Flow Control for On-chip Networks, International Symposium on Networks-on-Chip, 2010,

George Michelogiannakis, Router Designs for Elastic Buffer On-Chip Networks, Conference on High Performance Computing Networking, Storage and Analysis, 2009,

George Michelogiannakis, Elastic Buffer Flow Control for On-Chip Networks, International Symposium on High Performance Computer Architecture, 2009,

George Michelogiannakis, Approaching Ideal NoC Latency with Pre-Configured Routes, International Symposium on Networks-on-Chip, 2007,

Reports

George Michelogiannakis, John Shalf, David Donofrio, John Bachan,, "Continuing the Scaling of Digital Computing Post Moore’s Law", LBNL report, April 2016, LBNL 1005126,

The approaching end of traditional CMOS technology scaling that up until now followed Moore's law is coming to an end in the next decade. However, the DOE has come to depend on the rapid, predictable, and cheap scaling of computing performance to meet mission needs for scientific theory, large scale experiments, and national security. Moving forward, performance scaling of digital computing will need to originate from energy and cost reductions that are a result of novel architectures, devices, manufacturing technologies, and programming models. The deeper issue presented by these changes is the threat to DOE’s mission and to the future economic growth of the U.S. computing industry and to society as a whole. With the impending end of Moore’s law, it is imperative for the Office of Advanced Scientific Computing Research (ASCR) to develop a balanced research agenda to assess the viability of novel semiconductor technologies and navigate the ensuing challenges. This report identifies four areas and research directions for ASCR and how each can be used to preserve performance scaling of digital computing beyond exascale and after Moore's law ends.

Thesis/Dissertations

Energy-Efficient Flow-Control for On-Chip Networks, George Michelogiannakis, Stanford University, 2012,

With the emergence of on-chip networks, the power consumed by router buffers has become a primary concern. Bufferless flow control has been proposed to address this issue by removing router buffers and handling contention by dropping or deflecting flits. In this thesis, we compare virtual-channel (buffered) and deflection (packet-switched bufferless) flow control. Our study shows that unless process constraints lead to excessively costly buffers, the performance, cost and increased complexity of deflection flow control outweigh its potential gains. To provide buffering in the network but without the cost and timing overhead of router buffers, we propose elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers as well as the complexity for virtual channels (VCs) are no longer required. Therefore, EB networks have a shorter cycle time and offer more throughput per unit power than VC networks. We also propose a hybrid EB-VC router which is used to provide traffic separation for a number of traffic classes large enough for duplicate physical channels to be inefficient. These hybrid routers offer more throughput per unit power than both EB and VC routers. Finally, this thesis proposes packet chaining, which addresses the tradeoff between allocation quality and cycle time traditionally present in routers with VCs. Packet chaining is a simple and effective method to increase allocator matching efficiency to be comparable or superior to more complex and slower allocators without extending cycle time, particularly suited to networks with short packets.

Approaching Ideal NoC Latency with Pre-Configured Routes, George Michelogiannakis, University of Crete, 2007,

IPIF to PCI bridge specification, George Michelogiannakis, University of Crete, 2005,