Large Scale Optimistic Synchronization based simulation of Post Moore Systems (ARO Project)
In this project, we propose to build a post-Moore HPC (High-Performance Computing) system simulation framework to enable large scale simulations of post-Moore architectures built using emerging devices and technologies. With the HPC systems performance reaching exaflops and the transistor scaling reaching the saturation, the HPC systems to be built for post Moore era are evolving to extremely heterogeneous systems. For the Beyond Moore era, new computing, memory, interconnect and storage models are needed to reach the expected performance and energy benefits. Efficient simulation methods and algorithms are needed to successfully employ large scale parallel simulation of the future Beyond Moore HPC systems. PDES (Parallel Discrete Event Simulation) is the conventional method used to simulate these large-scale systems with the three types of synchronizations, namely conservative, optimistic and hybrid synchronizations. Recent state of the art simulators like the SST simulator are built with the view of making CMOS based HPC system simulation flexible, scalable and extensible. But most of these simulators are built based on a conservative synchronization model (no speculative execution allowed), which relies on periodic synchronization between neighboring partitions to prevent out of order execution of events. This approach works well for homogeneous, synchronous workloads with large time latencies between parallel partitions, since the time latency dictates the frequency of synchronization, and homogeneity ensures high utilization within each partition between synchronizations. On the other hand, we believe that increased heterogeneity, more asynchrony, and higher concurrency will result in system characteristics that play to the strengths of optimistic synchronization models. Optimistic simulators synchronize in a more fine-grained manner (at the level of logical processes) and only do so when actual synchronizing events occur rather than based on whether a synchronizing event could hypothetically occur (as conservative simulators do).
This work investigates the performance of optimistic synchronization in the domain of highly heterogeneous and asynchronous computer architectural simulations. In this work, we propose to extend our PARADISE tool flow which is a chip/ node level architectural simulator for post-Moore architectures to support large scale HPC system simulation using the optimistic synchronization based PDES simulation framework called Devastator. PARADISE is an open-source comprehensive methodology to evaluate emerging technologies with a vertical simulation flow from the individual device level all the way up to the architectural level. PARADISE can be extended to incorporate new technologies for which a compact model exists. Devastator is a parallel runtime that implements the Time Warp optimistic parallel discrete event protocol. The runtime design was motivated by the dual desires for both high performance on HPC systems and increased functionality and programmability compared to existing PDES APIs. Devastator is built on top of GASNet-Ex and heavily utilizes its lightweight active messages (e.g. fire-and forget remote procedure calls), which are a first-order primitive that is highly tuned for the networking stacks predominant in HPC. Additionally, for passing messages between threads in the same process, Devastator utilizes a custom thread-to-thread active message layer that is lock-free and atomic-free, yielding a strong performance boost for x86 processors. On top of the global thread-to-thread messaging paradigm is an optimistic PDES engine that implements the Time Warp protocol. Each CPU core manages a set of owned logical processes (LPs). Each event class provides a forward execution and a backwards rollback mechanism to enable optimistic execution
Our method for computing global virtual time (GVT) is completely asynchronous and concurrent with the processing of events. We have demonstrated a proof-of-concept scalable transportation simulator that implements PDES on high performance computing platform. In that work, we simulated 22 million vehicle trips over a road network with 1.1 million nodes and 2.2 million links, processing 3.9 billion events in less than 10 seconds using 16,384 cores on NERSC’s Cori computer.
This work enhances the post-Moore models built in PARADISE to be used with the Devastator runtime and enable the large-scale (1000s of cores) simulation of Post-Moore computing architectures at the HPC system level. We will also investigate and tune the performance of Devastator’s optimistic-style synchronization in the context of architectural simulation.
PARADISE++ is based on the work done with PARADISE.