Anastasiia Butko, Ph.D. is a Research Scientist in the Computational Research Division at Lawrence Berkeley National Laboratory (LBNL), CA. Her research interests lie in the general area of computer architecture, with particular emphasis on high-performance computing, emerging and heterogeneous technologies, associated parallel programming and architectural simulation techniques. Broadly, her research addresses the question of how alternative technologies can provide continuing performance scaling in the approaching Post-Moore’s Law era. Her primary research projects include development of the EDA tools for fast superconducting logic design, development of the classical ISA for quantum processor control, development of the fast and flexible System-on-Chip generators using Chisel DSL.
Dr. Butko received her Ph.D. in Microelectronics from the University of Montpellier, France (2015). Her doctoral thesis developed fast and accurate simulation techniques for many-core architectures exploration. Her graduate work has been conducted within the European project MontBlanc, which aims to design a new supercomputer architecture using low-power embedded technologies.
Dr. Butko received her MSc. Degree in Microelectronics from UM2, France and MSc and BSc Degrees in Digital Electronics from NTUU "KPI", Ukraine. During her Master she participated on the international program of double diploma between Montpellier and Kiev universities.
- iARPA SuperTools
- Open2C: Open Cache Coherency
- OpenSoC Fabric: An Open-Source Network-On-Chip Generator
- qFirm: Digital Firmware for Classical Control of Qubits
- Project 38: A set of vendor-agnostic architectural explorations involving NSA, the DOE Office of Science, and NNSA
- QUASAR Ice: A 32-bit RISC-V Core with Quantum Specific Extensions
- VTE: Verilator Testbench Environment
Rafael Garibotti, Anastasiia Butko, Luciano Ost, Abdoulaye Gamatié, Gilles Sassatelli, Chris Adeniyi-Jones, "Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures", IEEE Transactions on Computers, August 1, 2016, doi: 10.1109/TC.2015.2485202
A large portion of existing multithreaded embedded software has been programmed according to symmetric shared memory platforms where a monolithic memory block is shared by all cores. Such platforms accommodate popular parallel programming models such as POSIX threads and OpenMP. However with the growing number of cores in modern manycore embedded architectures, they present a bottleneck related to their centralized memory accesses. This paper proposes a solution tailored for an efficient execution of applications defined with shared-memory programming models onto on-chip distributed-memory multicore architectures. It shows how performance, area and energy consumption are significantly improved thanks to the scalability of these architectures. This is illustrated in an open-source realistic design framework, including tools from ASIC to microkernel.
Luciano Ost, Rafael Garibotti, Gilles Sassatelli, Gabriel Marchesan Almeida, Rémi Busseuil, Anastasiia Butko, Michel Robert, Jürgen Becker, "Novel Techniques for Smart Adaptive Multiprocessor SoCs", IEEE Transactions on Computers, March 20, 2013, doi: 10.1109/TC.2013.57
The growing concerns of power efficiency, silicon reliability and performance scalability motivate research in the area of adaptive embedded systems, i.e. systems endowed with decisional capacity, capable of online decision making so as to meet certain performance criteria. The scope of possible adaptation strategies is subject to the targeted architecture specifics, and may range from simple scenario-driven frequency/voltage scaling to rather complex heuristic-driven algorithm selection. This paper advocates the design of distributed memory homogeneous multiprocessor systems as a suitable template for best exploiting adaptation features, thereby tackling the aforementioned challenges. The proposed solution lies in the combined use of a typical application processor for global orchestration along with such an adaptive multiprocessor core for the handling of data-intensive computation. This paper describes an exploratory homogeneous multiprocessor template designed from the ground up for scalability and adaptation. The proposed contributions aim at increasing architecture efficiency through smart distributed control of architectural parameters such as frequency, and enhanced techniques for load balancing such as task migration and dynamic multithreading.
Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf, "Extending classical processors to support future large scale quantum accelerators", Proceedings of the 16th ACM International Conference on Computing Frontiers Pages, April 2019,
Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf, "TIGER: topology-aware task assignment approach using ising machines", Proceedings of the 16th ACM International Conference on Computing Frontiers, April 2019,
Anastasiia Butko, Albert Chen, David Donofrio, Farzad Fatollahi-Fard, John Shalf, "Open2C: Open-source Generator for Exploration of Coherent Cache Memory Subsystems", MEMSYS '18, New York, NY, USA, ACM, 2018, 311--317, doi: 10.1145/3240302.3270314
Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, "Efficient Programming for Multicore Processor Heterogeneity: OpenMP versus OmpSs", Open Source Supercomputing Workshop, Frankfurt, Germany, Springer’s Lecture Notes in Computer Science (LNCS), June 22, 2017,
ARM single-ISA heterogeneous multicore processors combine high-performance big cores with power-efficient small cores. They aim at achieving a suitable balance between performance and energy. How- ever, a main challenge is to program such architectures so as to efficiently exploit their features. In this paper, we study the impact on performance and energy trade-offs of single-ISA architecture according to OpenMP 3.0 and the OmpSs programming models. We consider different symmetric/asymmetric architecture configura- tions in terms of core frequency and core count between big and LITTLE clusters. Experiments are conducted on both a real Samsung Exynos 5 Octa system-on-chip and the gem5/McPAT simulation frameworks. Results show that OmpSs implementations are more sensitive to loop scheduling parameters than OpenMP 3.0. In most cases, best OmpSs configurations significantly outperform OpenMP ones. While cluster frequency asym- metry provides uninteresting results, asymmetric cluster configuration with single high-performance core and multiple low-power cores provides better performance/energy trade-offs in many cases.
D Vasudevan, A Butko, G Michelogiannakis, D Donofrio, J Shalf, "Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies", Springer International Publishing, January 1, 2017, 115--123, doi: 10.1007/978-3-319-67630-2_10
With the decline and eventual end of historical rates of lithographic scaling, we arrive at a crossroad where synergistic and holistic decisions are required to preserve Moore's law technology scaling. Numerous emerging technologies aim to extend digital electronics scaling of performance, energy efficiency, and computational power/density,
ranging from devices (transistors), memories, 3D integration capabilities, specialized architectures, photonics, and others.
The wide range of technology options creates the need for an integrated strategy to understand the impact of these emerging technologies on future large-scale digital systems for diverse application requirements and optimization metrics.
In this paper, we argue for a comprehensive methodology that spans the different levels of abstraction -- from materials, to devices, to complex digital systems and applications. Our approach integrates compact models of low-level characteristics of the emerging technologies to inform higher-level simulation models to evaluate their responsiveness to application requirements.
The integrated framework can then automate the search for an optimal architecture using available emerging technologies to maximize a targeted optimization metric.
Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, David Novo, Lionel Torres, Michel Robert, "Full-System Simulation of big. LITTLE Multicore Architecture for Performance and Energy Exploration", Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2016 IEEE 10th International Symposium on, Lyon, France, IEEE, September 21, 2016, doi: 10.1109/MCSoC.2016.20
Single-ISA heterogeneous multicore processors have gained increasing popularity with the introduction of recent technologies such as ARM big.LITTLE. These processors offer increased energy efficiency through combining low power in-order cores with high performance out-of-order cores. Efficiently exploiting this attractive feature requires careful management so as to meet the demands of targeted applications. In this paper, we explore the design of those architectures based on the ARM big.LITTLE technology by modeling performance and power in gem5 and McPAT frameworks. Our models are validated w.r.t. the Samsung Exynos 5 Octa (5422) chip. We show average errors of 20% in execution time, 13% for power consumption and 24% for energy-to-solution.
Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, Michel Robert, "Position Paper: OpenMP scheduling on ARM big. LITTLE architecture", 9th Int’l Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG), Prague, Czech Republic, January 18, 2016,
Single-ISA heterogeneous multicore systems are emerging as a promising direction to
achieve a more suitable balance between performance and energy consumption. However,
a proper utilization of these architectures is essential to reach the energy benefits. In this
paper, we demonstrate the ineffectiveness of popular OpenMP scheduling policies
executing Rodinia benchmark on the Exynos 5 Octa (5422) SoC, which integrates the ARM
big. LITTLE architecture.
Anastasiia Butko, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, Michel Robert, "Design exploration for next generation high-performance manycore on-chip systems: Application to big. LITTLE architectures", VLSI (ISVLSI), 2015 IEEE Computer Society Annual Symposium on, Montpellier, France, IEEE, July 8, 2015, doi: 10.1109/ISVLSI.2015.28
Next generation embedded systems will massively adopt on-chip many core architectures to provide both performance and energy-efficiency. This trend will definitely establish the convergence of embedded computing and high-performance computing. In such a context, one major design challenge will concern the choice of adequate architecture parameters given system requirements. Moreover, it will affect the way applications can suitably exploit architecture resources for an efficient execution. This paper deals with many core on-chip system design exploration by using via simulation. It presents an approach enabling one to study central design parameters in an accurate and cost-effective manner. This approach is illustrated through the design exploration for ARM big. LITTLE heterogeneous multicore technology in the gem5 framework.
Anastasiia Butko, Rafael Garibotti, Luciano Ost, Vianney Lapotre, Abdoulaye Gamatie, Gilles Sassatelli, Chris Adeniyi-Jones, "A trace-driven approach for fast and accurate simulation of manycore architectures", Design Automation Conference (ASP-DAC), 2015 20th Asia and South Pacific, Chiba, Japan, IEEE, January 19, 2015, doi: 10.1109/ASPDAC.2015.7059093
The evolution of manycore systems, forecasted to feature hundreds of cores by the end of the decade calls for efficient solutions for design space exploration and debugging. Among the relevant existing solutions the well-known gem5 simulator provides a rich architecture description framework. However, these features come at the price of prohibitive simulation time that limits the scope of possible explorations to configurations made of tens of cores. To address this limitation, this paper proposes a novel trace-driven simulation approach for efficient exploration of manycore architectures.
Sophiane Senni, Lionel Torres, Gilles Sassatelli, Anastasiia Bukto, Bruno Mussard, "Exploration of magnetic ram based memory hierarchy for multicore architecture", VLSI (ISVLSI), 2014 IEEE Computer Society Annual Symposium on, Tampa, FL, USA, IEEE, July 9, 2014, doi: 10.1109/ISVLSI.2014.29
Today's memory systems mainly integrate SRAM, DRAM and FLASH technologies. SRAM and DRAM are generally used for cache and working memory, while FLASH memory is used for non-volatile storage at low speed. But all are facing to manufacturing constraints in the most advanced node, which compromises further evolution. Besides, with the increasing size of the memory system, a significant portion of the total system power is spent into memories. Magnetic RAM (MRAM) technology is a very attractive alternative offering simultaneously reasonable performance and power consumption efficiency, high density and non-volatility. While MRAM is always under severe investigation to improve manufacturing process, the state of the art shows that this memory technology can be accessed in less than 5ns with a read/write dynamic energy not so far to SRAM dynamic energy. Besides, non-volatility of MRAM can be used for optimizing leakage current thanks to instant on/off policies. This paper demonstrates how current characteristics of MRAM can be used into memory hierarchy of multiprocessor chips (CMPs). The goal is to highlight the interest to use MRAM for cache memory in order to keep overall application performance saving static power.
Sophiane Senni, Lionel Torres, Gilles Sassatelli, Anastasiia Bukto, Bruno Mussard, "Power efficient thermally assisted switching magnetic memory based memory systems", Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), 2014 9th International Symposium on, Montpellier, France, IEEE, May 26, 2014, doi: 10.1109/ReCoSoC.2014.6861357
With the increasing size of the memory system inside today's chips, memories are becoming a critical part of the design of modern embedded systems. SRAM, DRAM and FLASH, respectively used for cache, working memory and non-volatile storage, are the three main memory technologies of current memory hierarchies. But all are facing to manufacturing constraints in the most advanced node, which compromises further evolution. Magnetic RAM (MRAM) technology is a very attractive alternative offering simultaneously reasonable performance and power consumption efficiency, high density and non-volatility. Among the MRAM technologies, while Toggle MRAM suffers from scalability issue and Spin Transfer Torque MRAM (STT-MRAM) is facing to data retention failure, Thermally Assisted Switching MRAM (TAS-MRAM) uses a scheme allowing a fully scalable bit cell, low power writing and excellent data retention. This paper demonstrates how features of TAS-MRAM can lead to power efficient memory systems. A case study of a TAS-MRAM-based L2 cache shows significant power saving while keeping reasonable performance.
Anastasiia Butko, Rafael Garibotti, Luciano Ost, Gilles Sassatelli, "Accuracy evaluation of gem5 simulator system", Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2012 7th International Workshop on, York, UK, IEEE, July 9, 2012, doi: 10.1109/ReCoSoC.2012.6322869
Design space exploration (DSE) of complex embedded systems that combine a number of CPUs, dedicated hardware and software is a tedious task for which a broad range of approaches exists, from the use of high-level models to hardware prototyping. Each of these entails different simulation speed/accuracy tradeoffs, and thereby enables exploring a certain subset of the design space in a given time. Some simulation frameworks devoted to CPU-centric systems have been developed over the past decade, that either feature near real-time simulation speed or moderate to high speed with quasi-cycle level accuracy, often by means of instruction-set simulators or binary translation techniques. This paper presents an evaluation in term of accuracy in modeling real systems using the GEM5 simulator that belong to the first class. Performance figures of a wide range of benchmarks (e.g. in domains such as scientific computing and media applications) are captured and compared to results obtained on real hardware.