# **Chiplets for HPC**

George Michelogiannakis, LBNL Material credit: John Shalf, LBNL

**Open Chiplet Economy** 

**OCP Sponsored Tutorial** 

Compute Project® Chiplet Summit Feb 6<sup>th</sup> 2024 1:00 pm to 5:00 pm Santa Clara California, USA

# HPC's Future if we Don't Change Course





# Specialization is Nature's Way

#### **Powerful General Purpose**





#### Many Different Specialized (Post-Moore Scarcity)





OPEN

Compute

Project<sup>®</sup>

Intel KNL, AMD, Cavium/Marvell, GPUs

Apple, Google, Amazon, AWS

# We Have to Understand The Market





## Domain Specific Compute Driven by Hyperscalars





## **Opportunity for HPC: New Economic Model**

**Open Chiplets Marketplace is forming (ODSA and UClexpress)** 

- Licensable IP and assembly by 3<sup>rd</sup> party lowers that barrier
- Leverage the economic model being created by HyperScale
- Leverage this baseline and extend to support HPC
  - Smaller incremental cost for HPC to "play"
  - HPC has become "too small to attack the city"

### 80:20 Rule: Focus open efforts on what uniquely benefits HPC

- Build up a library of reusable accelerators for HPC.
- Interoperability for sustainability: Interoperate with commercial IP where it exists and focus on open the 20% that doesn't make commercial sense to license





# Architecture Specialization for Science









#### Materials

Density Functional Theory (DFT) Use O(n) algorithm Dominated by FFTs FPGA or ASIC

#### CryoEM Accelerator

LBNL detector 750 GB / sec Custom ASIC near detector Genomics Accelerator String matching Hashing 2-8bit (ACTG) FPGA

#### Digital fluid Accelerator

3D integration Petascale *chip* 1024-layers General / special HPC solution

OPEN Compute Project®

### Algorithm-Driven Design of Programmable Hardware Accelerators Example: LS3DF/Density Functional Theory (DFT)

What: Design the hardware acceleration around the target algorithm/application

# Why: Huge opportunities to improve performance density and efficiency

 FFT hardware accelerator 50x-100x faster than GPU (using SPIRAL generator)

### **How:** Target Density Functional Theory

- 1. Large fraction of the DOE workload
- 2. Mature code base and algorithm
- 3. LS3DF formulation minimizes off-chip communication and scales O(N)





## The DFT kernel for each fragment

BERKELEY LAB

Communication Avoiding LS3DF Formulation – Scales O(N)



## Chiplets Make Specialization Accessible for HPC



See the multi-agency chiplets workshop at

https://sites.google.com/lbl.gov/chiplets-workshop-2023/home of custom solutions



# More Flexible and Lower Cost

### **PROVEN EXISTING BUSINESS MODELS**





# HPC

Since 1987 - Covering the Fastest Computers in the World and the People Who Run Them

- Home
- Topics
- Sectors
- Exascale
- Specials
- Resource Library
- Podcast
- Events
- Solution Channels
- Job Bank
- About
- Subscribe





#### AMD Opens Up Chip Design to the Outside for Custom Future By Agam Shah

June 15, 2022

AMD is getting personal with chips as it sets sail to make products more to the liking of its customers.

The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package.

"We are focused on making it easier to implement chips with more flexibility," said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week.



AMD will allow customers to implement multiple dies — also called chiplets or compute tiles — in a tight chip package. AMD already uses tiles, but is now welcoming third parties to make accelerators or other chips to be included in 2D or 3D packages alongside its x86 CPUs and GPUs.

## Standardized die-to-die (D2D) Physical Layer Interfaces (ODSA)



# A protocol: UCle

Uses CXL or PCIe

I/O attach with PCIe/CXL.ioMemory use cases: CXL.memAccelerator use cases: CXL.cache Open Chiplet: Platform on a Package

High-Speed Standardized Chip-to-Chip Interface (UCIe)

Service of the servic

20X I/O Performance at 1/20<sup>th</sup> Power at Launch – Gap more prominent with better on-package technologies in future

Sea of Cores (heterogeneous

Customer IP and

**Customized Chiplets** 

Memory

Advanced 2D/ 2.5D/ 3D Packaging

https://www.nextplatform.com/2022/03/02/ industry-behemoths-back-intels-universal-chiplet-interconnect/

https://www.snia.org/sites/default/files/PM-Summit/2022/PMCS22-Park-CXL-and-



## **ODSA: Open Domain Specific Architecture** Creating an Open Chiplet Marketplace



## Photonic MCM for High Escape Bandwidth for Remote Memory





## **Project38: HPC Improvements Through Innovative Architecture** *Cross-agency architectural exploration*

Project 38 (P38) is a set of vendor-agnostic architectural explorations involving DOD, the DOE Office of Science, and NNSA

- Mission:
- Demonstrate high performance IUSG node -- codesigned to accelerate GraphBLAS
- Demonstrate modular integration of LBNL/ANL IUSG + commercial IP using Open Chiplets
- Create new capability for the USG to rapidly assemble/prototype server-class chip designs



Affordable heterogeneous co-integration using chiplets

### Accomplishments thus far

- Released integration platform (MoSAIC)
  - Abstract model to RTL to chiplets or FPGAs
- Created end-to-end cost model for chiplet integration
- Chisel FFT, sparse matrix multiply, and TSGR generators
- GraphBLAS Accelerator ISA for RISC-V (GISA)
- AMD collaboration showed benefit of sparse matrix/tensor accel





#### Look for the project 38 poster!

# Questions?





## **One Challenge is Escape Bandwidth**



- **Good News:** Extend bandwidth density and lower power/bit
- Bad News: Limited (~2cm) reach
  - Cannot get outside of the package (<u>but we</u> <u>need to</u>)







- 5X the bandwidth v. GDDR5
- Up to 16GB
- One-third the footprint
- Half the energy per bit
- Managed memory stack for optimal levels of reliability, availability and serviceability

9



### Chiplet Bandwidth Roadmap (5 generations of BW doubling)

# Table 5: Physical IO Scaling Roadmap for 2D and Enhanced-2D Architectures that use both solder and hybrid interconnects.

| Generation Number →                    |                                            | 1   | 2   | 3    | 4    | 5     |
|----------------------------------------|--------------------------------------------|-----|-----|------|------|-------|
| Raw Linear Bandwidth Density (GBps/mm) |                                            | 125 | 250 | 500  | 1000 | 2000  |
| Package Technology                     | Minimum Bump Pitch (µm) <sup>17</sup>      | 55  | 40  | 30   | 20   | 10    |
|                                        | Linear Escape Density (IO/mm)              | 500 | 667 | 1000 | 2000 | 4000  |
|                                        | Areal Escape Density (IO/mm <sup>2</sup> ) | 331 | 625 | 1111 | 2500 | 10000 |
| Signaling Speed (Gbps)                 |                                            | 2   | 3   | 4    | 4    | 4     |

#### **5.1.2** Area Interconnects for 3D Architectures (see Figure 1)

Table 6: Physical IO Scaling Roadmap for 3D architectures that use both solder and hybrid interconnects.

| Generation Number →                                               |                                            | 1   | 2    | 3    | 4    | 5     |
|-------------------------------------------------------------------|--------------------------------------------|-----|------|------|------|-------|
| Raw Areal Bandwidth Density (GBps/mm <sup>2</sup> ) <sup>18</sup> |                                            | 125 | 250  | 500  | 1000 | 2000  |
| Package Technology                                                | Minimum Bump Pitch $(\mu m)^{19}$          | 40  | 30   | 20   | 15   | 10    |
|                                                                   | Areal Escape Density (IO/mm <sup>2</sup> ) | 625 | 1111 | 2500 | 4444 | 10000 |
| Signaling Speed (Gbps) <sup>20</sup>                              |                                            | 1.6 | 1.8  | 1.6  | 1.8  | 1.6   |



## **Package Limited Bandwidth**





2

### Rapid Prototyping of HPC Data Analytics Engine using Open/Modular Chiplets



### **Package Performance is Pin Limited**



Rent's Rule: Number of pins = K x Gates<sup>a</sup> (IBM, 1960) K = 0.82, a = 0.45 for early Microprocessors



#### Pins x GHz from Rent's Rule

Bandwidth Gap: ~500 x and growing!

High SERDES rates run counter to end of Dennard Scaling

### Datacenters: Worsten climate change without ultra-energy-efficiency And data movement dominates that power consumption



- January 2021 SRC report projects datacenter energy growth rates will lead to ~25% consumption of planetary energy by 2040.
- Data movement is a dominant contributor to that power consumption



## <u>What</u> is a Chiplet?





# **Different Options**



