

Electrical Engineering and

### **Computer Sciences**

# The Roofine Model: A Pedagogical Tool for Program Analysis and Optimization Sam Williams, David Patterson

### Motivation, Goals and Audience

Performance and scalability can be extremely non-intuitive on modern architectures

Success of the multicore paradigm should be premised on extending the capabilities of the world's programmers

Must provide a visually intuitive performance model that the bulk of the world's programmers (not just Ph.Ds) can use to optimize code

Provide realistic performance and productivity expectations Focus on kernel performance and efficiency (e.g. Gflop/s)

### **Not intended for:**

•those interested in fine tuning (+5%) those challenged by program correctness

### SPMD Performance Components

### Three performance components for SPMD kernels

### Computation

Typically floating point (single or double precision)

Could also be graphics, crypto, integer or bitwise operations

### **Communication**

Transfer of data from one level of the memory hierarchy to the next Registers, L1, L2, DRAM, PCIe, Network

### Locality

Balance between communication and computation

- Incorporates 3C's model for cache behavior
- Results in a Flop:Byte ratio

### **Arithmetic Intensity**

The flop per DRAM byte ratio is a well known quantity from HPC: Arithmetic Intensity

Some kernels have arithmetic intensity that grows with the problem size (FFT, Matrix-Matrix multiplication, etc...) However, many interesting kernels have arithmetic intensity that remains constant with problem size (Matrix-Vector multiplication, Structured Grids, etc...)



## Building the Roofline Model for Opteron



### Locality (arithmetic intensity)

| Defines flop:DRAM byte r                      | ratio     | <b>S</b><br>128 |  |
|-----------------------------------------------|-----------|-----------------|--|
| (walls)                                       | <b>()</b> | 120             |  |
| Can never do better                           | iflop/s   | 64              |  |
| than the compulsory                           | flo       | 32              |  |
| flop:byte ratio                               | Ċ         | 16              |  |
| Each architecture/kernel                      | tainable  |                 |  |
| combination has a unique                      | na        | 8               |  |
| number of capacity and                        | ttai      | 4               |  |
| conflict misses                               | a         | 2               |  |
| Ideally obtained with<br>performance counters |           | 1               |  |
| This example is an arbitrary                  |           |                 |  |
| kernel                                        | _         |                 |  |
|                                               |           |                 |  |

**Roofline Model** 

communication, and

by the optimizations

implemented:

Integrate computation,

locality into a single figure

In-core optimizations

(horizontal ceilings)

Bandwidth optimizations

(diagonal ceilings)

(vertical walls)

Memory traffic optimizations

able

# 2 8 flop:DRAM byte ratio

peak SP



# PARALLEL COMPUTING LABORATORY

### Three Types of Software Optimization

Performance is limited by ceilings/walls In effect, a software optimization punches through them Performance at the kernel's (new) arithmetic intensity is limited by the remaining ceilings



### **Other Architectural Paradigms ?**

### Does this technique apply to other architectural paradigms? Create single precision (32b) roofline models for:

- AMD Opteron 2356 (Barcelona)
- ■IBM QS20 Cell Blade
- Sun T2+ T5140 (Victoria Falls)
- NVIDIA G80

### In-core Performance:

In-core parallelism is the primary challenge on superscalars Non-FP instructions can sap the potentially limited instruction fetch/issue bandwidth

On NVIDIA, divergent threads consume more fetch bandwidth. Bandwidth:

Typically common (SW prefetch, NUMA, unit stride, ...)

•NVIDIA: if threads aren't all collaboratively loading a contiguous block, performance drops by more than 8x

### Arithmetic intensity:

Each architecture has unique cache/local store behavior (3C's) Each architecture has its own Achilles' Heel(s)

### In-core ILP/DLP





flop:DRAM byte ratio



/<sub>8</sub> <sup>1</sup>/<sub>4</sub> <sup>1</sup>/<sub>2</sub> 1 2 4 8 16

flop:DRAM byte ratio



Reference Villiams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.



| ŵТ  | WC         |
|-----|------------|
| ļ   | ■A         |
| ļ   | ■X         |
| ۰Т  | hre        |
|     | D          |
|     | ■M         |
|     | ■M         |
| ♦L  |            |
|     | R          |
|     | =13<br>■13 |
|     |            |
|     |            |
|     | ∎ar        |
| *A  |            |
|     |            |
|     | imp<br>- C |
|     | ■Ca<br>mis |
|     |            |
|     |            |
|     |            |
|     | $\leq$     |
|     |            |
| +   | Y          |
|     | ma         |
|     |            |
| Pof | ora        |
|     |            |

Reference Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS) (to appear), 2008. **Best Paper, Application Track** 



### Using the Roofline to Analyze SpMV

### Sparse Matrix

- Most entries are 0.0
- Performance advantage in only
- storing/operating on the nonzeros Requires significant meta data
- Evaluate y=Ax ■A is a sparse matrix
  - **•x** & **y** are dense vectors
- Challenges
  - Difficult to exploit ILP
  - Difficult to exploit DLP
  - Irregular memory access to x
  - Difficult to load balance
- Very low arithmetic intensity (<0.166)</p>

### Auto-tuning

- •NUMA, SW prefetch improve efficiency Matrix compression eliminates compulsory misses
- Original performance
- Auto-tuned performance



# Using the Roofline to Analyze LBMHD

### Plasma turbulence simulation o distributions: is a sparse matrix & **y** are dense vectors ree macroscopic quantities: Density Iomentum (cartesian vector) Agnetic Field (cartesian vector) tice update: Read 73 doubles 300 floating point operations Vrite 79 doubles rithmetic intensity $\sim 1.0$ (ideal) to-tuning

IUMA, unrolling, and SIMDization prove efficiency

Cache bypass eliminates compulsory sses

- Original performance
- Auto-tuned performance











<sup>1</sup>/<sub>16</sub> <sup>1</sup>/<sub>8</sub> <sup>1</sup>/<sub>4</sub> <sup>1</sup>/<sub>2</sub> 1 2 4 8

flop:DRAM byte ratio