

# Methodology for Evaluating the Potential of Disaggregated Memory Systems

<u>Nan Ding</u>, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright

Lawrence Berkeley National Laboratory

# Outline

- Need-to-know about Memory disaggregation
- Disaggregated memory system architecture
- Characterize application performance on a disaggregated memory system
- Case Study





# What is memory disaggregation?

Today:

- Compute nodes are the basic unit of today's HPC systems
- Compute and memory resources are tightly coupled in each node
- Users request resources in the unit of a node

Memory Disaggregation:

- Decouple the compute and memory resources
- Allow for independent allocations of these resources regardless of where a job is placed



Compute node

J.S. DEPARTMENT OF

Office of

Science





## Memory disaggregation addresses memory imbalance and improves memory utilization

Expensive memory is often under utilized:

- At Azure: ~50% of all VMs never touch 50% of their rented memory
- At NERSC: only 15% of the scientific workloads on NERSC's Cori supercomputer use over 75% of the available on-node memory
- At LLNL: 10% of jobs utilize more than 75% of the node memory capacity

Memory disaggregation is practical for public could

- Meet performance requirements and low hardware cost
- CXL-Based Memory Pooling Systems for Cloud Platforms

What impact will these emerging technologies have on HPC?





#### Memory Disaggregation on HPC: More Memory, Less cost



DRAM w/ 16 DIMMS HBM3 w/ 8 stacks 16-Hi HBM2 w/ 8 stacks 8-Hi



5



# Outline

- Need-to-know about Memory disaggregation
- Disaggregated memory system architecture
- Characterize application performance on a disaggregated memory system
- Case Study





#### **Conceptual disaggregated memory system architecture**

- One compute node (C) = one APU + HBM3 (512GB) + one NIC
- One memory node (M) = one DDR5 socket (4TB) + one NIC







## **Available Remote Memory Capacity**



BERKELEY LAB

Science

## **Available Remote Memory Capacity**

- Not all jobs need remote memory
- HBM3 provides 512GB

**BERKELEY LAB** 

> 512GB, need remote memory

![](_page_8_Figure_4.jpeg)

![](_page_8_Picture_5.jpeg)

![](_page_8_Picture_6.jpeg)

#### **Available Remote Memory Bandwidth**

![](_page_9_Figure_1.jpeg)

**BERKELEY LAB** 

10

Science

# Outline

- Need-to-know about Memory disaggregation
- Disaggregated memory system architecture
  - A structured system design model
  - Budgets, workloads
  - Available remote memory capacity/bandwidth
- Characterize application performance on a disaggregated memory system
- Case Study

![](_page_10_Picture_8.jpeg)

Office of

## Memory Roofline: bound by local memory or remote memory

- FLOP Roofline: Which takes longer?
  - o Data movement
  - Compute

- Memory Roofline: Which takes longer?
  - Local memory (= HBM data movement)
  - Remote memory (= DRAM data movement)

![](_page_11_Figure_7.jpeg)

![](_page_11_Picture_8.jpeg)

#### Bounded by Remote Memory Bandwidth? No, if High L:R

![](_page_12_Figure_1.jpeg)

![](_page_12_Picture_2.jpeg)

#### Remote Memory Access Pattern Implication on Different System Configurations [fixed C:M, vary workloads demands]

![](_page_13_Figure_1.jpeg)

![](_page_13_Picture_2.jpeg)

## Remote Memory Access Pattern Implication on Different System Configurations [fixed C:M, vary workloads demands]

![](_page_14_Figure_1.jpeg)

Less contention

| ↑ |     | Remote memory<br>bandwidth | Machine<br>balance |
|---|-----|----------------------------|--------------------|
|   | 10% | 100GB/s                    | 65.5               |
|   | 40% | 25 GB/s                    | 262                |

![](_page_14_Figure_4.jpeg)

![](_page_14_Picture_5.jpeg)

![](_page_14_Picture_7.jpeg)

#### Remote Memory Access Pattern Implication on Different System Configurations [fixed workloads demand, vary C:M]

![](_page_15_Figure_1.jpeg)

![](_page_15_Picture_2.jpeg)

LAB

#### Remote Memory Access Pattern Implication on Different System Configurations [fixed workloads demand, vary C:M]

![](_page_16_Figure_1.jpeg)

![](_page_16_Picture_2.jpeg)

#### Characterize workloads in a single figure

![](_page_17_Figure_1.jpeg)

![](_page_17_Picture_2.jpeg)

![](_page_17_Picture_4.jpeg)

#### Characterize workloads in a single figure

![](_page_18_Figure_1.jpeg)

Science

**BERKELEY LAB** 

19

# Outline

- Need-to-know about Memory disaggregation
- Disaggregated memory system architecture
- Characterize application performance on a disaggregated memory system
  - Local : Remote memory access ratio
  - Required memory capacity
  - System configurations
- Case Study

![](_page_19_Picture_8.jpeg)

Office of

Science

#### **Experiment setup**

- C:M=10K:1K
- 10% compute node that are requiring remote memory
- Each compute can access 4TB remote DRD5 memory on average
- Each compute can reach peak PCIe6 bandwidth 100GB/s
- L:R=65.5

![](_page_20_Picture_6.jpeg)

![](_page_20_Picture_8.jpeg)

# Case Study: Al workloads

- The DeepCAM climate benchmark is based on the 2018 work of Kurth et al. which was awarded the ACM Gordon Bell Prize
- The L:R memory ratio is characterized by <u>FLOP:sample Byte</u> Flop:HBM Byte from Ibrahim et al.
- Training set = memory capacity (all in memory nodes)

![](_page_21_Figure_4.jpeg)

![](_page_21_Picture_5.jpeg)

## **Case Study: Eleven Workloads from Five Computational Scenarios**

- 10 out of 11 workloads can leverage disaggregated memory without affecting performance
- STREAM can be a proxy for giant AI=O(1) linear solvers (stencil/sparse) without any multiphysics/AMR

![](_page_22_Figure_3.jpeg)

![](_page_22_Picture_4.jpeg)

# Conclusions

- A practical and intuitive approach to assess how much disaggregation is needed or viable given the technology trend and the impacts to the diverse workload
- Low PCIe bandwidth does not destroy the value of memory disaggregation, combine Local:Remote memory access ratio, memory capacity requirement
- It's promising for HPC applications benefit from disaggregated memory system
- Beneficial to wider groups: HPC architects, scientists

![](_page_23_Picture_5.jpeg)

![](_page_23_Picture_7.jpeg)