Patricia received her Ph.D. (2019) and M.Sc. (2015) from the University of Virginia, Charlottesville, VA, USA; and her Bachelors Degree from the Pontifical Xavierian University (2008), Bogota, Colombia. She also worked as an ASIC Design and Verification Engineer in Hewlett-Packard, Costa Rica.
Her research interests include ultra-low power VLSI digital and mixed-signal design. Her work focuses on non-conventional computing paradigms to reduce power and energy consumption, such as, synchronous and asynchronous stochastic computing, computing with sigma-delta streams and race logic.
Patricia Gonzalez-Guerrero, Tommy Tracy II, Xinfei Guo, Rahul Sreekumar, Marzieh Lenjani, Kevin Skadron, Mircea R Stan, "Towards on-node Machine Learning for Ultra-low-power Sensors Using Asynchronous Σ Δ Streams", Journal on Emerging Technologies in Computing Systems (JETC), August 26, 2020, doi: https://doi.org/10.1145/3404975
We propose a novel architecture to enable low-power, complex on-node data processing, for the next generation of sensors for the internet of things (IoT), smartdust, or edge intelligence. Our architecture combines near-analog-memory-computing (NAM) and asynchronous-computing-with-streams (ACS), eliminating the need for ADCs. ACS enables ultra-low power, massive computational resources required to execute on-node complex Machine Learning (ML) algorithms; while NAM addresses the memory-wall that represents a common bottleneck for ML and other complex functions. In ACS an analog value is mapped to an asynchronous stream that can take one of two logic levels (vh, vl). This stream-based data representation enables area/power-efficient computing units such as a multiplier implemented as an AND gate yielding savings in power of ∼90% compared to digital approaches. The generation of streams for NAM and ACS in a brute force manner, using analog-to-digital-converters (ADCs) and digital-to-streams-converters, would sky-rocket the power-latency-energy cost making the approach impractical. Our NAM-ACS architecture eliminates expensive conversions, enabling an end-to-end processing on asynchronous streams data-path. We tailor the NAM-ACS architecture for random forest (RaF), an ML algorithm, chosen for its ability to classify using a reduced number of features. Simulations show that our NAM-ACS architecture enables 75% of savings in power compared with a single ADC, obtaining a classification accuracy of 85% using an RaF-inspired algorithm
Xinfei Guo, Vaibhav Verma, Patricia Gonzalez-Guerrero, Sergiu Mosanu, Mircea R Stan, "Back to the future: Digital circuit design in the finfet era", Journal of Low Power Electronics, September 1, 2017, doi: https://doi.org/10.1166/jolpe.2017.1489
It has been almost a decade since FinFET devices were introduced to full production; they allowed scaling below 20 nm, thus helping to extend Moore's law by a precious decade with another decade likely in the future when scaling to 5 nm and below. Due to superior electrical parameters and unique structure, these 3-D transistors offer significant performance improvements and power reduction compared to planar CMOS devices. As we are entering into the sub-10 nm era, FinFETs have become dominant in most of the high-end products; as the transition from planar to FinFET technologies is still ongoing, it is important for digital circuit designers to understand the challenges and opportunities brought in by the new technology characteristics. In this paper, we study these aspects from the device to the circuit level, and we make detailed comparisons across multiple technology nodes ranging from conventional bulk to advanced planar technology nodes such as Fully Depleted Silicon-on-Insulator (FDSOI), to FinFETs. In the simulations we used both state-of-art industry-standard models for current nodes, and also predictive models for future nodes. Our study shows that besides the performance and power benefits, FinFET devices show significant reduction of short-channel effects and extremely low leakage, and many of the electrical characteristics are close to ideal as in old long-channel technology nodes; FinFETs seem to have put scaling back on track! However, the combination of the new device structures, double/multi-patterning, many more complex rules, and unique thermal/reliability behaviors are creating new technical challenges. Moving forward, FinFETs still offer a bright future and are an indispensable technology for a wide range of applications from high-end performance-critical computing to energy-constraint mobile applications and smart Internet-of-Things (IoT) devices.
A. Roy, A. Klinefelter, F. B. Yahya, X. Chen, L. P. Gonzalez-Guerrero, C. J. Lukas, D. A. Kamakshi, J. Boley, K. Craig, M. Faisal, S. Oh, N. E. Roberts, Y. Shakhsheer, A. Shrivastava, D. P. Vasudevan, D. D. Wentzloff, B. H. Calhoun, "A 6.45 μW Self-Powered SoC With Integrated Energy-Harvesting Power Management and ULP Asymmetric Radios for Portable Biomedical Systems", IEEE Transactions on Biomedical Circuits and Systems, December 28, 2015, 9:862 - 874, doi: 10.1109/TBCAS.2015.2498643
This paper presents a batteryless system-on-chip (SoC) that operates off energy harvested from indoor solar cells and/or thermoelectric generators (TEGs) on the body. Fabricated in a commercial 0.13 μW process, this SoC sensing platform consists of an integrated energy harvesting and power management unit (EH-PMU) with maximum power point tracking, multiple sensing modalities, programmable core and a low power microcontroller with several hardware accelerators to enable energy-efficient digital signal processing, ultra-low-power (ULP) asymmetric radios for wireless transmission, and a 100 nW wake-up radio. The EH-PMU achieves a peak end-to-end efficiency of 75% delivering power to a 100 μA load. In an example motion detection application, the SoC reads data from an accelerometer through SPI, processes it, and sends it over the radio. The SPI and digital processing consume only 2.27 μW, while the integrated radio consumes 4.18 μW when transmitting at 187.5 kbps for a total of 6.45 μW.
Meriam Gay Bautista, Patricia Gonzalez-Guerrero, Darren Lyles, George Michelogiannakis, "Superconducting Shuttle-flux Shift Buffer for Race Logic", 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), August 2021,
George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko, "SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC", May 2021,
Marzieh Lenjani, Patricia Gonzalez, Elaheh Sadredini, Shuangchen Li, Yuan Xie, Ameen Akel, Sean Eilert, Mircea R Stan, Kevin Skadron, "Fulcrum: a simplified control and access mechanism toward flexible and practical in-situ accelerators", International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, IEEE, February 22, 2020, doi: 10.1109/HPCA47549.2020.00052
In-situ approaches process data very close to the memory cells, in the row buffer of each subarray. This minimizes data movement costs and affords parallelism across subarrays. However, current in-situ approaches are limited to only row-wide bitwise (or few-bit) operations applied uniformly across the row buffer. They impose a significant overhead of multiple row activations for emulating 32-bit addition and multiplications using bitwise operations and cannot support operations with data dependencies or based on predicates. Moreover, with current peripheral logic, communication among subarrays is inefficient, and with typical data layouts, bits in a word are not physically adjacent. The key insight of this work is that in-situ, single-word ALUs outperform in-situ, parallel, row-wide, bitwise ALUs by reducing the number of row activations and enabling new operations and optimizations. Our proposed lightweight access and control mechanism, Fulcrum, sequentially feeds data into the single-word ALU and enables operations with data dependencies and operations based on a predicate. For algorithms that require communication among subarrays, we augment the peripheral logic with broadcasting capabilities and a previously-proposed method for low-cost inter-subarray data movement. The sequential processor also enables overlapping of broadcasting and computation, and reuniting bits that are physically adjacent. In order to realize true subarray-level parallelism, we introduce a lightweight column-selection mechanism through shifting one-hot encoded values. This technique enables independent column selection in each subarray. We integrate Fulcrum with Compress Express Link (CXL), a new interconnect standard. Fulcrum with one memory stack delivers on average (up to) 23.4 (76) speedup over a server-class GPU, NVIDIA P100, with three stacks of HBM2 memory, (ii) 70 (228) times speedup per memory stack over the GPU, and (iii) 19 (178.9) times speedup per memory stack over an ideal model of the GPU, which only accounts for the overhead of data movement.
Marzieh Lenjani, Patricia Gonzalez, Elaheh Sadredini, M Arif Rahman, Mircea R Stan, "An overflow-free quantized memory hierarchy in general-purpose processors", International Symposium on Workload Characterization (IISWC), Orlando, FL, USA, IEEE, November 3, 2019, doi: 10.1109/IISWC47752.2019.9042035
Data movement comprises a significant portion of energy consumption and execution time in modern applications. Accelerator designers exploit quantization to reduce the bitwidth of values and reduce the cost of data movement. However, any value that does not fit in the reduced bitwidth results in an overflow (we refer to these values as outliers). Therefore accelerators use quantization for applications that are tolerant to overflows. We observe that in most applications the rate of outliers is low and values are often within a narrow range, providing the opportunity to exploit quantization in general-purpose processors. However, a software implementation of quantization in general-purpose processors has three problems. First, the programmer has to manually implement conversions and the additional instructions that quantize and dequantize values, imposing a programmer's effort and performance overhead. Second, to cover outliers, the bitwidth of the quantized values often become greater than or equal to the original values. Third, the programmer has to use standard bitwidth; otherwise, extracting non-standard bitwidth (i.e., 1-7, 9-15, and 17-31) for representing narrow integers exacerbates the overhead of software-based quantization. The key idea of this paper is to propose a hardware support in the memory hierarchy of general-purpose processors for quantization, which represents values by few and flexible numbers of bits and stores outliers in their original format in a separate space, preventing any overflow. We minimize metadata and the overhead of locating quantized values using a software-hardware interaction that transfers quantization parameters and data layout to hardware. As a result, our approach has three advantages over cache compression techniques: (i) less metadata, (ii) higher compression ratio for floating-point values and cache blocks with multiple data types, and (iii) lower overhead for locating the compressed blocks. It delivers on average 1.40/1.45/1.56× speedup and 24/26/30% energy reduction compared to a baseline that uses full-length variables in a 4/8/16-core system. Our approach also provides 1.23× speedup, in a 4-core system, compared to the state of the art cache compression techniques and adds only 0.25% area overhead to the baseline processor.
Patricia Gonzalez-Guerrero, Mircea R. Stan, "Asynchronous Stochastic Computing", 53rd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, IEEE, November 3, 2019, doi: 10.1109/IEEECONF44664.2019.9049011
Asynchronous Stochastic Computing (ASC) leverages Synchronous Stochastic Computing (SSC) advantages and addresses its drawbacks. In SSC a multiplier is a single AND gate, saving ~ 90% of power and area compared with a typical 8bit binary multiplier. The key for SSC power-area efficiency comes from mapping numbers to streams of 1s and 0s. Despite the power-area efficiency, SSC drawbacks such as long latency, costly clock distribution network (CDN), and expensive stream generation, causes the energy consumption to grow prohibitively large. In this work, we introduce the foundations for ASC using Continuous-time-Markov-chains, and analyze the computing error due to random fluctuations. In ASC data is mapped to asynchronous-continuous-time streams, which yields two advantages over the synchronous counterpart: (1) CDN elimination, and (2) better accuracy performance. We compare ASC with SSC for three applications: (1) multiplication, (2) an image processing algorithm: gamma-correction, and (3) a singlelayer of a fully-connected artificial-neural-network (ANN) using a FinFET1X technology. Our Matlab, Spice-level simulations and post-place&route (P&R) reports demonstrate that ASC yields savings of 10%-55%, 33%-44%, and 50% in latency, power, and energy respectively. These savings make ASC a good candidate to address the ultra-low-power requirements of machine learning for the IoT.
Patricia Gonzalez-Guerrero, Tommy Tracy II, Xinfei Guo, Mircea R Stan, "Towards low-power random forest using asynchronous computing with streams", Tenth International Green and Sustainable Computing Conference (IGSC), EEE Computer Society, October 1, 2019, doi: 10.1109/IGSC48788.2019.8957193
We propose a sensor architecture for the internet of things (IoT), smartdust or edge-intelligence (EI) that combines near-analog-memory (NAM) processing and asynchronous computing with streams (ACS) addressing the need for machine learning (ML) capabilities at low power budgets. In ACS an analog value is mapped to an asynchronous stream that can take one of two values (vh, vl). This stream-based data representation enables area-power efficient computing units such as the multiplier implemented as an AND gate yielding savings in power of 90% compared with digital approaches. However, a major bottleneck for computing on streams, vision sensors, and NAM approaches is the cost of analog-to-digital (ADC) and digital-to-stream-to-digital converters. Our NAM-ACS architecture, simplifies the sensor and eliminates the need for the expensive conversions. The architecture is tailored for random forest (Raf), a ML algorithm, chosen for its ability to classify using a reduced number of features. Our simulations show that using an analog-memory array of 256 512, the power consumption of the ACS-core combined with the memory interface is comparable with the consumption of an ADC based memory interface, obtaining an accuracy of 83%.
Patricia Gonzalez-Guerrero, Stephen G Wilson, Mircea R Stan, "Error-latency Trade-off for Asynchronous Stochastic Computing with ΣΔ Streams for the IoT", International System-on-Chip Conference (SOCC), Singapore, IEEE, September 3, 2019, doi: 10.1109/SOCC46988.2019.1570548453
Asynchronous stochastic computing (ASC) using continuous-time-asynchronous ΣΔ modulators (SC-AΣΔM) has the potential to enable ultra-low-power, on-node machine learning algorithms for the next generation of sensors for the Internet of Things (IoT). Similar to synchronous stochastic computing (SSC 1 ), in SC-AΣΔM complex processing units can be implemented with simple gates because numbers are represented with streams. For example a multiplier is implemented with a XNOR gate, yielding savings in power and area of 90% compared with the typical binary approach. Previous work demonstrated that SC-AΣΔM leverages SSC advantages and addresses its drawbacks, achieving significant savings in energy, power and latency. In this work, we study a theoretical model to determine the fundamental limits of accuracy and computing time for SCAΣΔM. Since the ΣΔ streams are periodic the final computing error is non-zero and depends on the period of the input streams. We validate our theoretical model with Spice-level simulations and evaluate the power and energy consumption using a standard FinFet1X2 technology for two cases: 1) multiplication and 2) gamma correction, an image processing algorithm. Our work determines circuit design guidelines for SC-AΣΔM and shows that multiplication with SC-AΣΔM requires at least 6X less time than SSC. The latency reduction and novel architecture positively impacts the overall energy consumption in the IoT node, enabling savings in energy of 79% compared with the binary approach.
Patricia Gonzalez-Guerrero, Mircea R Stan, "Asynchronous Stream Computing for Low Power IoT", International Midwest Symposium on Circuits and Systems (MWSCAS), Dallas, TX, USA, IEEE, August 4, 2019, doi: 10.1109/MWSCAS.2019.8885388
Asynchronous circuits have many advantages over their synchronous counterparts in terms of robustness to parameter variations, wide supply voltage ranges, and potentially low power by not needing a clock, yet their promise has not been translated yet into commercial success due to several issues related to design methodologies and the need for handshake signals. Stochastic computing is another processing paradigm that has shown promises of low power and extremely compact circuits but has yet to become a commercial success mainly because of the need for a fast clock to generate the random streams. The Asynchronous Stream Processing circuits described in this paper combine the best features of asynchronous circuits (lack of clock, robustness) with the best features of stochastic computing (processing on streams) to enable extremely compact and low power IoT sensing nodes that can finally fulfill the promise of smart dust, another concept that was ahead of its time and yet to achieve commercial success.
Patricia Gonzalez-Guerrero, Xinfei Guo, Mircea R Stan, "ASC-FFT: Area-efficient low-latency FFT design based on asynchronous stochastic computing", 10th Latin American Symposium on Circuits & Systems (LASCAS), Armenia, Colombia, IEEE, February 24, 2019, doi: 10.1109/LASCAS.2019.8667599
Asynchronous Stochastic Computing (ASC) is a new paradigm that addresses Synchronous Stochastic Computing (SSC) drawbacks, expensive stochastic number generation (SNG) and long latency, by using continuous time streams (CTS). To go beyond the basic operations of addition and multiplication in ASC we need to incorporate a memory element. Although for SSC the natural memory element is a clocked-flip-flop, using the same approach with no synchronized data leads to unacceptable large error. In this paper, we propose to use a capacitor embedded in a feedback loop as the ASC memory element. Based on this idea, we design a low-error asynchronous adder that stores the carry information in the capacitor. Our adder enables the implementation of more complex computation logic. As an example, we implement an asynchronous stochastic Fast Fourier Transform (ASC-FFT) using a FinFET1X 1 technology. The proposed adder requires 76%-24% less hardware cost compared against conventional and SSC adders respectively. Besides, the ASC-FFT shows 3X less latency when compared with SSC-FFT approaches and significant improvements in latency and area over conventional FFT architectures with no degradation of the computation accuracy measured by the FFT Signal to Noise Ratio (SNR).
Patricia Gonzalez-Guerrero, Xinfei Guo, Mircea Stan, "SC-SD: Towards low power stochastic computing using sigma delta streams", International Conference on Rebooting Computing (ICRC), McLean, VA, USA, IEEE, November 7, 2018, doi: 10.1109/ICRC.2018.8638611
Processing data using Stochastic Computing (SC) requires only ~ 7% of the area and power of the typical binary approach. However, SC has two major drawbacks that eclipse any area and power savings. First, it takes sim 99% more time to finish a computation when compared with the binary approach, since data is represented as streams of bits. Second, the Linear Feedback Shift Registers (LFSRs) required to generate the stochastic streams increment the power and area of the overall SC-LFSR system. These drawbacks result in similar or higher area, power, and energy numbers when compared with the binary counterpart. In this work, we address these drawbacks by applying SC directly on Pulse Density Modulated (PDM) streams. Most modern Systems on Chip (SoCs) already include Analog to Digital Converters (ADCs). The core of Σ△ -ADCs is the Σ△ Modulator whose output is a PDM stream. Our approach (SC-SD) simplifies the system hardware in two ways. First, we drop the filter stage at the ADC and, second, we replace the costly Stochastic Number Generators (SNGs) with Σ△ -Modulators. To further lower the system complexity, we adopt an Asynchronous Σ△ -Modulator (AΣ△M) architecture. We design and simulate the AΣ△M: using an industry-standard 1×FinFET 11 In modern technologies the node number does not refer to any one feature in the process, and foundries use slightly different conventions; we use 1x to denote the 14/16nm FinFET nodes offered by the foundry. technology with foundry models. We achieve power savings of 81 % in SNG compared to the LFSR approach. To evaluate how this area and power savings scale to more complex applications, we implement Gamma Correction, a popular image processing algorithm. For this application, our simulations show that SC-SD can save 98%-11% in the total system latency and 50%-38% in power consumption when compared with the SC-LFSR approach or the binary counterpart.
Xinfei Guo, Vaibhav Verma, Patricia Gonzalez-Guerrero, Mircea R Stan, "When “things” get older: exploring circuit aging in IoT applications", International Symposium on Quality Electronic Design (ISQED), Santa Clara, CA, USA, IEEE, March 13, 2018, doi: 10.1109/ISQED.2018.8357304
The Internet of Things (IoT) brings a paradigm where humans and “things” are connected. Reliability of these devices becomes extremely critical. Circuit aging has become a limiting factor in technology scaling and a significant challenge in designing IoT systems for reliability-critical applications. As IoT becomes a general-purpose technology which starts to adapt to the advanced process nodes, it is necessary to understand how and on what level aging affects different categories of IoT applications. Since aging is highly dependent on operating conditions and switching activities, this paper classifies the IoT applications based on the aging-related metrics and studies aging using the foundry-provided FinFET aging models. We show that for many IoT applications, aging will indeed add to the already tight design margin. As the expected chip lifetime in IoT devices becomes much longer and the failure tolerant requirements of these applications become much more strict, we conclude that aging needs to be considered in the full design cycle and the IoT lifetime estimation needs to incorporate aging as an important factor. We also present application-specific solutions to mitigate circuit aging in IoT systems.
Patricia Gonzalez-Guerrero, Mircea Stan, "Ultra-low-power dual-phase latch based digital accelerator for continuous monitoring of wheezing episodes", SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), Burlingame, CA, USA, IEEE, October 16, 2017, doi: 10.1109/S3S.2017.8308752
We designed an ultra-low-power accelerator for the calculation of the Short Time Fourier Transform (STFT) optimized for wheezing detection. The low power consumption of our accelerator relies on optimizations at different stages of the design process. Post-layout simulations show that at the minimum energy point our accelerator consumes 3.3 pJ/cycle at 0.5 V and 163 KHz. We compare the energy consumption of our implementation with its flip-flop version. Simulations show that we can save up to 50% in energy consumption for a latch based design vs. a flip-flop based design, making dual-phase latch based implementations excellent candidates for ultra-low-power devices.
Alicia Klinefelter, Nathan E Roberts, Yousef Shakhsheer, Patricia Gonzalez, Aatmesh Shrivastava, Abhishek Roy, Kyle Craig, Muhammad Faisal, James Boley, Seunghyun Oh, Yanqing Zhang, Divya Akella, David D Wentzloff, Benton H Calhoun, "A 6.45 μW self-powered IoT SoC with integrated energy-harvesting power management and ULP asymmetric radios", International Solid-State Circuits Conference-(ISSCC) Digest of Technical Papers, San Francisco, CA, USA, IEEE, February 22, 2015, doi: 10.1109/ISSCC.2015.7063087
A 1 trillion node internet of things (IoT) will require sensing platforms that support numerous applications using power harvesting to avoid the cost and scalability challenge of battery replacement in such large numbers. Previous SoCs achieve good integration and even energy harvesting , but they limit supported applications, need higher end-to-end harvesting efficiency, and require duty-cycling for RF communication. This paper demonstrates a highly integrated, flexible SoC platform that supports multiple sensing modalities, extracts information from data flexibly across applications, harvests and delivers power efficiently, and communicates wirelessly.