# An Instruction Roofline Model for GPUs

# Van Ding, Samuel Williams

Computational Research Division Lawrence Berkeley National Lab <u>{nanding, swwilliams}@lbl.gov</u> Nov.18<sup>th</sup> ,2019





# **History of Roofline Models**

• Sustainable performance is bound by

 $GFLOP/s = min - \begin{cases} Peak GFLOP/s \\ AI * GB/s \end{cases}$ 

### Roofline Model

• Arithmetic Intensity (AI) : FLOPs/Byte



Hierarchical Roofline Model

- Arithmetic Intensity (AI) : FLOPs/Byte(L1/L2/DRAM)
- Additional compute ceilings: No-FMA peak



### s/Byte(L1/L2/DRAM) -FMA peak GFLOP/s -FMA CELOP/s

### **Roofline is Useful**



**Driving performance optimization** 

### However...

- Even with sufficient data locality, one cannot guarantee high performance
  - Pathological memory access patterns?
  - Re-design the data layout?
  - Limited by instruction throughput?
- Many applications perform more integer operations than floating-point/no-flops •



# **Peak GFLOP/s** <u>???</u>

### Arithmetic Intensity (Flops: Bytes)

### Motivation for the Instruction Roofline Model



### **Practical Use**

### What is holding you back?

What optimizations should

### When to stop optimization?

### **Drive Code Optimization** in a Good Visual Manner

## The First Step to Instruction Roofline Model

• Sustainable performance is bound by **GFLOP/s** = min -{ **Peak GFLOP/s**  $\lfloor sAI * GB/s \rfloor$ **GIPS** = min - **Peak GIPS** II \* GB/s

Expanding the applicability of roofline to several emerging computational domains

Form the basis for several subsequent Instruction **Roofline-oriented performance analysis** technologies

- Identify fetch-decode-issue bottlenecks
- Function unit utilization (FPU, tensor, integer, etc...)



Arithmetic Intensity (Flops: Byte)

## The Second Step to Instruction Roofline Model

- Sustainable performance is bound by **GFLOP/s** = min -{ **Peak GFLOP/s**  $\lfloor \mathbf{AI} * \mathbf{GB/s} \rfloor$ **GIPS** = min - **Peak GIPS**  $\lfloor II * GB/s$
- **Instruction Intensity**: Instructions per Byte •

Limitation:

Hard to motivate more performance analysis *techniques, such as memory pattern access* 

Expanding the applicability of roofline to several emerging computational domains





# A Final Step to Instruction Roofline Model on GPUs

- Instruction Intensity
  - Instructions per Transaction

Expanding the applicability of roofline to more performance analysis technologies GPUs

Form the basis for several subsequent Instruction **Roofline-oriented performance analysis** technologies on GPUs:

- Memory access patterns
- **Memory Transaction** 
  - the natural unit to access data on NVIDIA GPUs
  - the natural unit to **analyze memory access**
  - a warp-level load/store -> 1 32 transactions

GIPS = min → Peak GIPS

Instructions/Transaction \* GTXN/s





• Sustainable performance of is bound by

```
GIPS = min \square Peak GIPS
 L Instruction Intensity * GTransaction/s
```

- Theoretical Peak on V100: 80 SMs x 4 warp scheduler x 1 inst/cyc x 1.53GHz = 489.6 GIPS
- Memory ceilings on V100:
  - Based on the GB/s from Empirical Roofline Toolkit<sup>[1]</sup>
  - Calculate the number of equivalent 32-byte transactions



# Capabilities of Instruction Roofline Performance Model (1/2)

# Instruction Throughput





# **Capabilities of Instruction Roofline Model** -- Instruction Throughput

### **Instruction throughput** •

All instruction, Transactions of each memory level(L1/L2/HBM), runtime



1. Distance between the ceilings and dots can tell memory-bound or instruction-bound 2. Distance between the two plots (different memory level) can tell the data reuse.

# **Capabilities of Instruction Roofline Model** -- Instruction Throughput

### **Instruction throughput** •

All instruction, Transactions of each memory level(L1/L2/HBM), runtime



1. Distance between the ceilings and dots can tell memory-bound or instruction-bound 2. Distance between the two plots (different memory level) can tell the data reuse.

# Capabilities of Instruction Roofline Performance Model (2/2)

# Memory Access Patterns





Memory Access Pattern is Critical to Application Execution Time

### Easy to code in an inefficient memory pattern

Low performance

# Hidden deep in the code

Time consuming to reason the performance





## **Capabilities of Instruction Roofline Model** -- Global Memory Patterns



## **Capabilities of Instruction Roofline Performance Model** ---- Three Intensity ``Walls'' for Stride Global Memory Access Patterns



# **Capabilities of Instruction Roofline Performance Model** ---- Characterize Global Memory Access Patterns

Breakdown the L1 dot into Global Memory Only metrics -> Stride Global Memory Patterns according to **Global Memory Walls** 





### **Capabilities of Instruction Roofline Performance Model**

---- Shared Memory Access Patterns





# 32 banks per shared memory row

### No bank conflict

### **32-way** bank conflict

### No bank conflict

### **Capabilities of Instruction Roofline Performance Model** ---- Two Intensity ``Walls'' for Bank Shared Memory Access Patterns

- "No bank conflict" =
- 1 warp Shared LDST 1 Shared Transaction
- different 4-byte word, different bank
- same 4-byte word, same bank

- - different 4-byte words, same bank





# • "32-way bank conflict" = $\frac{1 \text{ warp Shared LDST}}{32 \text{ Shared Transactions}}$

# **Capabilities of Instruction Roofline Performance Model** ---- Characterize Shared Memory Access Patterns

Breakdown the L1 dot into Shared Memory Only metrics -> banked Shared Memory Patterns according to Shared Memory Walls





# An example to understand the outputs from Instruction Roofline Model

|                 | Example: Matrix Transpose                          |                                  |                                                 |
|-----------------|----------------------------------------------------|----------------------------------|-------------------------------------------------|
| Description     | A-> A <sup>T</sup> , stored in column major        |                                  |                                                 |
| Matrix size     | 1024 x 1024                                        |                                  |                                                 |
| Machine         | NVIDIA's latest V100 GPU                           |                                  |                                                 |
| Implementations | Naive                                              | Coalesced                        | Coalesced_NoBank@                               |
|                 | Simple<br>copy                                     | Coaleced global<br>memory access | Based on "Coale<br>Reduce shared n<br>conflicts |
|                 | Using 32×8 thread blocks operating on 32×32 matrix |                                  |                                                 |

### Conflict

### esced" memory bank

### ix tiles



**Output Matrix:** A<sup>T</sup> (column major)

Instuction Intensity (Warp Instructions per Transaction)





### **Coalesced Implementation**







### Summary

- Instruction Throughput
  - Expanding the applicability of roofline to several emerging computational domains
- Global Memory Access Patterns
  - Quantify the memory access pattern, e.g. unit-stride vs. gather/scatter
- Shared Memory Access Patterns
  - Denote the efficiency of shared memory access.

There's more in the paper !!

- Thread Predication
- More Examples
  - HPGMG (mixed precision): three implementations.
  - BatchSW (integer-only): two implementations



# **Closing Thoughts**





### What the Instruction Roofline Models Tell us...



### **Practical Use**

### Unified visualization of bandwidth and access

**Rapidly tell how** different aspects of architectures constrain

### **Future Work**

- Apply our methodology to other accelerated architectures •
- Extend the access efficiency concept to networking, I/O, and Lustre file systems. •

## Acknowledgement

- This material is based upon work supported by the Advanced Scientific Computing Research • Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
- This research used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and the Oak Ridge Leadership Facility which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
- We thank NVIDIA Corporation for their willingness to answer our myriad of questions on • nvprof metrics.

# Questions



200





BERKELEY LAB LAWRENCE BERKELEY NATIONAL LABORATORY

