





# How Open Source Hardware Will Drive the Next Generation of HPC Systems

# **George Michelogiannakis**

Research scientist Lawrence Berkeley National Laboratory





#### **Diminishing Returns**

Creating smaller circuitry has placed more transistors on chips but triggered higher costs.







# Performance

# Now – 2025

Moore's Law continues through ~5nm -- beyond which diminishing returns are expected.

2016

2016-2025

End of Moore's Law 2025-2030?

# Post Moore Scaling

New materials and devices introduced to enable continued scaling of electronics performance and efficiency.

Performance

2025+



# **More Accelerators in HPC**









# **Performance Share**











\* How do we design accelerators for a wide variety of applications?



Yakun S et al "Aladdin"



# **But This Will Further Increase Cost**













Root cause: complexity growth







# Reduce Hardware Development Effort to Explore the Specialization Spectrum with:

**Open-Source Hardware** 

**High-Level Synthesis Languages** 





- \* Closed-source IP major drag to innovation
  - High barrier to entry
  - Open nature enables customization
- \* Create a community
- \* Shorten design cycles
  - Share hardware and software stack
- \* Open-source hardware can form the basis of generators







- \* Shows there is a large community interest
- ★ Does not go far enough
  - Majority are point designs
- \* 1190 projects
  - 55 labeled "mature"







### The Rise of Open Source Software: Will Hardware Follow Suit?



- Rapid growth in the adoption and number of open source software projects
- More than 95% of web servers run Linux variants, approximately 85% of smartphones run Android variants
- Will open source hardware ignite the semiconductor industry?

GSA 2016











- New DSLs raise abstraction level
  Increase productivity and code re-use
- Hardware generators more efficient
  Reduce cost, risk, design time







# **Reuse: Shared Lines of RTL Code (Chisel)**

| RISC-V Core                | Z-scale                                                                                            | Rocket                                                                                                            | BOOM                                                                                                                          |
|----------------------------|----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| Description                | 32-bit<br>3-stage pipeline<br>in-order<br>1-instruction<br>issue<br>L1 caches<br>(≈ ARM Cortex-M0) | 64-bit, FPU, MMU<br>5-stage pipeline<br>in-order<br>1-instruction<br>issue<br>L1 & L2 caches<br>(≈ ARM Cortex-A5) | 64-bit, FPU, MMU<br>5-stage pipeline<br>out-of-order<br>2-, 3-, or 4- instruction issue<br>L1 &L2 caches<br>(≈ ARM Cortex-A9) |
| Unique LOC                 | 600 (40%)                                                                                          | 1,400 (10%)                                                                                                       | 9,000 (45%)                                                                                                                   |
| LOC all 3 share            | 500 (30%)                                                                                          | 500 (5%)                                                                                                          | 500 (5%)                                                                                                                      |
| LOC Z-scale & Rocket share | 500 (30%)                                                                                          | 500 (5%)                                                                                                          |                                                                                                                               |
| LOC Rocket & BOOM share    |                                                                                                    | 10,000 (80%)                                                                                                      | 10,000 (50%)                                                                                                                  |
| Total LOC                  | 1,600                                                                                              | 12,400                                                                                                            | 19,500                                                                                                                        |





## **Use Open-Source Hardware: Specialization Opportunities**



# **A Specialization Opportunity**



- \* On-detector processing 50 010 40 30 Future detectors have data rates 30 20 20 exceeding 1 Tb/s 10 Proposed solution: 0 Process data before it leaves 2010 the sensor Application-tailored, programmable processing
  - Programmability allows processing to be tailored to the experiment







| 7 Giants of Data (NRC) | 7 Motifs of Simulation |  |
|------------------------|------------------------|--|
| Basic statistics       | Monte Carlo methods    |  |
| Generalized N-Body     | Particle methods       |  |
| Graph-theory           | Unstructured meshes    |  |
| Linear algebra         | Dense Linear Algebra   |  |
| Optimizations          | Sparse Linear Algebra  |  |
| Integrations           | Spectral methods       |  |
| Alignment              | Structured Meshes      |  |





# \* Quantum Computer = Quantum PU + Control Hardware



1000 qubits, gate time 10ns, 3 ops/qubit **300 billion ops per second** 





# **Some Current Projects**





### \* A complete set of tools





# **OpenSoC System Architect**









- Shockingly but accidentally similar to Sunway node architecture
- 4 Z-Scale processors connected on a 4x4 mesh and Micron HMC memory
- Two people spent two months to create









