



## Performance Tuning of Scientific Codes with the Roofline Model

1:30pm Introduction to Roofline

Using Roofline in NESAP 2:00pm

Using LIKWID for Roofline 2:20pm

Using NVProf for Roofline 2:40pm

Samuel Williams

Jack Deslippe

Charlene Yang

Protonu Basu

3:00pm

break / setup NERSC accounts

3:30pm

3:50pm

4:45pm

Introduction to Intel Advisor Hands-on with Intel Advisor closing remarks / Q&A

Charlene Yang Samuel Williams all





# Introductions

#### **Samuel Williams**

Computational Research Division
Lawrence Berkeley National Lab

#### **Charlene Yang**

NERSC
Lawrence Berkeley National Lab

#### **Jack Deslippe**

NERSC
Lawrence Berkeley National Lab
JRDeslippe@lbl.gov

#### Protonu Basu

Computational Research Division
Lawrence Berkeley National Lab





# Introduction to the Roofline Model

#### Samuel Williams

**Computational Research Division Lawrence Berkeley National Lab** 

<u>SWWilliams@lbl.gov</u>





# Acknowledgements

- This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
- This material is based upon work supported by the DOE RAPIDS SciDAC Institute.
- This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-05CH11231.
- Special Thanks to:
  - Zakhar Matveev, Intel Corporation
  - Roman Belenov, Intel Corporation





# Introduction to Performance Modeling

#### Why Use Performance Models or Tools?

- Identify performance bottlenecks
- Motivate software optimizations
- Determine when we're done optimizing
  - Assess performance relative to machine capabilities
  - Motivate need for algorithmic changes
- Predict performance on future machines / architectures
  - Sets realistic expectations on performance for future procurements
  - Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of today's applications.



- Many different components can contribute to kernel run time.
- Some are characteristics of the application, some are characteristics of the machine, and some are both (memory access pattern + caches).

#FP operations Flop/s

Cache data movement Cache GB/s

DRAM data movement DRAM GB/s

PCIe data movement PCIe bandwidth

Depth OMP Overhead

MPI Message Size Network Bandwidth

MPI Send: Wait ratio Network Gap

#MPI Wait's Network Latency



Can't think about all these terms all the time for every application...

```
Computational
  Complexity #FP operations Flop/s
        Cache data movement Cache GB/s
        DRAM data movement DRAM GB/s
          PCIe data movement PCIe bandwidth
                       Depth OMP Overhead
            MPI Message Size Network Bandwidth
           MPI Send: Wait ratio Network Gap
                  #MPI Wait's Network Latency
```



 Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s
Cache data movement Cache GB/s
DRAM data movement DRAM GB/s
PCIe data movement PCIe bandwidth
Depth OMP Overhead
MPI Message Size Network Bandwidth
MPI Send:Wait ratio Network Gap
#MPI Wait's Network Latency



 Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s

Cache data movement Cache GB/s

DRAM data movement DRAM GB/s

LogCA PCIe data movement PCIe bandwidth

Depth OMP Overhead

MPI Message Size Network Bandwidth

MPI Send: Wait ratio Network Gap

#MPI Wait's Network Latency



 Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s

Cache data movement Cache GB/s

DRAM data movement DRAM GB/s

PCIe data movement PCIe bandwidth

Depth OMP Overhead

MPI Message Size Network Bandwidth

MPI Send: Wait ratio Network Gap

#MPI Wait's Network Latency



LogGP

 Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s
Cache data movement Cache GB/s
DRAM data movement DRAM GB/s
PCIe data movement PCIe band Right model
PCIe band Right model
Depth OMP C
MPI Message Size Network B
MPI Send:Wait ratio
#MPI Wait's Network Late icy







# Roofline Model:

**Arithmetic Intensity and Bandwidth** 

#### Performance Models / Simulators

- Historically, many performance models and simulators tracked latencies to predict performance (i.e. counting cycles)
- The last two decades saw a number of latency-hiding techniques...
  - Out-of-order execution (hardware discovers parallelism to hide latency)
  - HW stream prefetching (hardware speculatively loads data)
  - Massive thread parallelism (independent threads satisfy the latency-bandwidth product)
- Effective latency hiding has resulted in a shift from a latency-limited computing regime to a throughput-limited computing regime



#### **Roofline Model**

- Roofline Model is a throughputoriented performance model...
  - Tracks <u>rates</u> not times
  - Augmented with Little's Law (concurrency = latency\*bandwidth)
  - Independent of ISA and architecture (applies to CPUs, GPUs, Google TPUs<sup>1</sup>, etc...)



https://crd.lbl.gov/departments/computer-science/PAR/research/roofline



- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)

```
Time = max #FP ops / Peak GFlop/s
#Bytes / Peak GB/s
```





- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)







- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)







- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)

Note, Arithmetic Intensity (AI) = Flops / Bytes (as presented to DRAM)





- Plot Roofline bound using Arithmetic Intensity as the x-axis
- Log-log scale makes it easy to doodle, extrapolate performance along Moore's Law, etc...
- Kernels with Al less than machine balance are ultimately DRAM bound (we'll refine this later...)





### Roofline Example #1

- Typical machine balance is 5-10 flops per byte...
  - 40-80 flops per double to exploit compute capability
  - Artifact of technology and money
  - Unlikely to improve
- Consider STREAM Triad...

```
#pragma omp parallel for
for(i=0;i<N;i++){
    Z[i] = X[i] + alpha*Y[i];
}</pre>
```

- 2 flops per iteration
- Transfer 24 bytes per iteration (read X[i], Y[i], write Z[i])
- Al = 0.083 flops per byte == Memory bound





### Roofline Example #2

- Conversely, 7-point constant coefficient stencil...
  - 7 flops
  - 8 memory references (7 reads, 1 store) per point
  - Cache can filter all but 1 read and 1 write per point
  - Al = 0.44 flops per byte == memory bound,
     but 5x the flop rate





#### **Hierarchical Roofline**

- Real processors have multiple levels of memory
  - Registers
  - L1, L2, L3 cache
  - MCDRAM/HBM (KNL/GPU device memory)
  - DDR (main memory)
  - NVRAM (non-volatile memory)
- Applications can have locality in each level
  - Unique data movements imply unique Al's
  - Moreover, each level will have a unique bandwidth



#### <u>Hierarchical Roofline</u>

- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





#### <u>Hierarchical Roofline</u>

- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





#### **Hierarchical Roofline**

- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





#### <u>Hierarchical Roofline</u>

- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum









# Roofline Model:

Modeling In-core Performance Effects

#### Data, Instruction, Thread-Level Parallelism...

Modern CPUs use several techniques to increase per core Flop/s

#### **Fused Multiply Add**

- w = x\*y + z is a common idiom in line; algebra
- Resurgence... se a

  QFMA, etc... se a

  And FMA)
- multiply and add in a single pipeline so that it can complete FMA/cycle

#### **Vector Instructions**

- Many HPC codes apply the same operation to a vector of elements
- Vendors provide vector instructions that apply the same operation to 2, 4, 8, 16 elements...
  x [0:7] \*y [0:7] + z [0:7]
- Vector FPUs complete 8 vector operations/cycle

#### **Deep pipelines**

- The hardware for a FMA is substantial.
- Breaking a single FMA up into several smaller operations and pipelining them allows vendors to increase GHz
- Little's Law applies...
  need FP\_Latency \*
  FP\_bandwidth
  independent instructions



#### Data, Instruction, Thread-Level Parallelism...

- If every instruction were an ADD (instead of FMA), performance would drop by 2x on KNL or 4x on Haswell
- Similarly, if one failed to vectorize, performance would drop by another 8x on KNL and 4x on Haswell
- Lack of threading (or load imbalance) will reduce performance by another 64x on KNL.





#### Superscalar vs. Instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance





#### Superscalar vs. Instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance





#### Superscalar vs. instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance





#### Superscalar vs. instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance





#### **Divides and other Slow FP instructions**

- FP Divides (sqrt, rsqrt, ...) might support only limited pipelining
- As such, their throughput is substantially lower than FMA's
- If divides constitute even if 3% of the flop's come from divides, performance can be cut in half.
- Penalty varies substantially between architectures and generations (e.g. IVB, HSW, KNL, ...)









# Roofline Model:

Modeling Cache Effects

 Naively, we can bound Al using only compulsory cache misses

$$AI = \frac{\#Flop's}{Compulsory Misses}$$





- Naively, we can bound Al using only compulsory cache misses
- However, write allocate caches can lower Al





- Naively, we can bound Al using only compulsory cache misses
- However, write allocate caches can lower Al
- Cache capacity misses can have a huge penalty







- Naively, we can bound Al using only compulsory cache misses
- However, write allocate caches can lower Al
- Cache capacity misses can have a huge penalty
- Compute bound became memory bound

 $AI = \frac{\text{#Flop's}}{\text{Compulsory Misses + Write Allocates + Capacity Misses}}$ 







### Roofline Model:

 Broadly speaking, there are three approaches to improving performance:





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware allocation)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware allocation)
- Minimize data movement (increase AI)









# Constructing a Roofline Model requires answering some questions...

#### Questions can overwhelm users...

What is my machine's Properties of the target machine

(Benchmarking)

FMA on my machine?

What is my machine's DDR GB/s?

How much data did my kernel actually move? Properties of an application's execution How many flop (Instrumentation) do? How much did that divide hurt?

Fundamental

properties of the kernel constrained by hardware









### We need tools...

#### **Node Characterization?**

- "Marketing Numbers" can be deceptive...
  - Pin BW vs. real bandwidth
  - TurboMode / Underclock for AVX
  - compiler failings on high-Al loops.
- LBL developed the Empirical Roofline Toolkit (ERT)...
  - Characterize CPU/GPU systems
  - Peak Flop rates
  - Bandwidths for each level of memory
  - MPI+OpenMP/CUDA == multiple GPUs





#### **Instrumentation with Performance Counters?**

- Characterizing applications with performance counters can be problematic...
  - x Flop Counters can be broken/missing in production processors
  - x Vectorization/Masking can complicate counting Flop's
  - x Counting Loads and Stores doesn't capture cache reuse while counting cache misses doesn't account for prefetchers.
  - x DRAM counters (Uncore PMU) might be accurate, but...
    - x are privileged and thus nominally inaccessible in user mode
    - x may need vendor (e.g. Cray) and center (e.g. NERSC) approved OS/kernel changes



#### Forced to Cobble Together Tools...

- Use tools known/observed to work on NERSC's Cori (KNL, HSW)...
  - Used Intel SDE (Pin binary instrumentation + emulation) to create software Flop counters
  - Used Intel VTune performance tool (NERSC/Cray approved) to access uncore counters
- Accurate measurement of Flop's (HSW) and DRAM data movement (HSW and KNL)
- Used by NESAP (NERSC KNL application readiness project) to characterize apps on Cori...



http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/



#### Initial Roofline Analysis of NESAP Codes





#### **Evaluation of LIKWID**

- LIKWID provides easy to use wrappers for measuring performance counters...
  - **✓ Works on NERSC production systems**
  - ✓ Minimal overhead (<1%)</p>
  - ✓ Scalable in distributed memory (MPI-friendly)
  - ✓ Fast, high-level characterization
  - x No detailed timing breakdown or optimization advice
  - x Limited by quality of hardware performance counter implementation (garbage in/garbage out)
- Useful tool that complements other tools





#### **Intel Advisor**

#### Includes Roofline Automation...

- Automatically instruments applications (one dot per loop nest/function)
- ✓ Computes FLOPS and AI for each function (CARM)
- ✓ AVX-512 support that incorporates masks
- ✓ Integrated Cache Simulator¹ (hierarchical roofline / multiple Al's)
- Automatically benchmarks target system (calculates ceilings)
- ✓ Full integration with existing Advisor capabilities



http://www.nersc.gov/users/training/events/roofline-training-1182017-1192017



#### Tools and Platforms for Roofline Modeling

scalable app-level

|           | ins                                     | trumentation |          |                                                            |              |
|-----------|-----------------------------------------|--------------|----------|------------------------------------------------------------|--------------|
|           |                                         |              | Intel    | Intel                                                      | NVIDIA       |
|           | Metric                                  | STREAM       | SDE      | Advisor                                                    | NVProf       |
| Benchmark | Peak MFlops  Per  Use ERT to  benchmark |              | ×        | Use Advisor for loop-level instrumentation and analysis on |              |
| 3en       |                                         |              | Ç        |                                                            |              |
| Execution | systems                                 |              |          |                                                            |              |
|           |                                         |              | <b>V</b> |                                                            |              |
|           | %5m.                                    | X            |          | lı lı                                                      | ntel targets |
|           | MIPS                                    | X            |          | X                                                          |              |
|           | DRAM BW                                 | ×            | ×        | $\checkmark$                                               | $\checkmark$ |
|           | Cache BW                                | X            | ×        |                                                            | $\checkmark$ |
|           | Auto-Roofline                           | X            | ×        | $\checkmark$                                               | X            |
| Platforms | Intel CPUs                              | <b>√</b>     | <b>√</b> | <b>√</b>                                                   | X            |
|           | IBM Power8                              | $\checkmark$ | ×        | ×                                                          | X            |
|           | <b>NVIDIA GPUs</b>                      | $\checkmark$ | ×        | ×                                                          | $\checkmark$ |
|           | AMD CPUs                                | $\checkmark$ | ?        | ?                                                          | X            |
|           | AMD GPUs                                | <b>/</b>     | X        | ×                                                          | ×            |
|           | ARM                                     | <b>/</b>     | X        | X                                                          | ×            |







### Questions?





### Backup





### Complexity, Depth, ...

#### Why Use Performance Models or Tools?

- Identify performance bottlenecks
- Motivate software optimizations
- Determine when we're done optimizing
  - Assess performance relative to machine capabilities
  - Motivate need for algorithmic changes
- Predict performance on future machines / architectures
  - Sets realistic expectations on performance for future procurements
  - Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of today's applications.



#### **Computational Complexity**

- Assume run time is correlated with the number of operations (e.g. FP ops)
- Users define parameterize their algorithms, solvers, kernels
- Count the number of operations as a function of those parameters
- Demonstrate run time is correlated with those parameters





#### **Data Movement Complexity**

- Assume run time is correlated with the amount of data accessed (or moved)
- Easy to calculate amount of data accessed... count array accesses
- Data moved is more complex as it requires understanding cache behavior...
  - Compulsory<sup>1</sup> data movement (array sizes) is a good initial guess...
  - ... but needs refinement for the effects of finite cache capacities

|                          | I                  |          |  |  |  |  |
|--------------------------|--------------------|----------|--|--|--|--|
| Operation                | Flop's             | Data     |  |  |  |  |
| DAXPY                    | O(N)               | O(N)     |  |  |  |  |
| DGEMV                    | O(N <sup>2</sup> ) | $O(N^2)$ |  |  |  |  |
| DGEMM                    | O(N <sup>3</sup> ) | $O(N^2)$ |  |  |  |  |
| FFTs                     | O(NlogN)           | O(N)     |  |  |  |  |
| CG                       | O(N1.331           |          |  |  |  |  |
| MG                       |                    |          |  |  |  |  |
| N-body Which is more     |                    |          |  |  |  |  |
| expensive                |                    |          |  |  |  |  |
| Performing Flop's, or    |                    |          |  |  |  |  |
| Moving words from memory |                    |          |  |  |  |  |
|                          |                    |          |  |  |  |  |
|                          |                    |          |  |  |  |  |



#### **Machine Balance and Arithmetic Intensity**

- Data movement and computation can operate at different rates
- We define machine balance as the ratio of...

...and arithmetic intensity as the ratio of...





#### **Distributed Memory Performance Modeling**

- In distributed memory, one communicates by sending messages between processors.
- Messaging time can be constrained by several components...
  - Overhead (CPU time to send/receive a message)
  - Latency (time message is in the network; can be hidden)
  - Message throughput (rate at which one can send small messages... messages/second)
  - Bandwidth (rate one can send large messages... GBytes/s)
- Bandwidths and latencies are further constrained by the interplay of network architecture and contention
- Distributed memory versions of our algorithms can be differently stressed by these components depending on N and P (#processors)



#### **Computational Depth**

- Parallel machines incur substantial overheads on synchronization (shared memory), point-to-point communication, reductions, and broadcasts.
- We can classify algorithms by depth (max depth of the algorithm's dependency chain)
- ➤ If dependency chain crosses process boundaries, we incur substantial overheads.









### Modeling NUMA

#### **NUMA Effects**

- Cori's Haswell nodes are built from 2 Xeon processors (sockets)
  - Memory attached to each socket (fast)
  - Interconnect that allows remote memory access (slow == NUMA)
  - Improper memory allocation can result in more than a 2x performance penalty











## Hierarchical Roofline vs. Cache-Aware Roofline

...understanding different Roofline formulations in Advisor

#### There are two Major Roofline Formulations:

#### Hierarchical Roofline (original Roofline w/ DRAM, L3, L2, ...)...

- Williams, et al, "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM, 2009
- Chapter 4 of "Auto-tuning Performance on Multicore Computers", 2008
- Defines multiple bandwidth ceilings and multiple Al's per kernel
- Performance bound is the minimum of flops and the memory intercepts (superposition of original, single-metric Rooflines)

#### Cache-Aware Roofline

- Ilic et al, "Cache-aware Roofline model: Upgrading the loft", IEEE Computer Architecture Letters, 2014
- Defines multiple bandwidth ceilings, but uses a single AI (flop:L1 bytes)
- As one looses cache locality (capacity, conflict, ...) performance falls from one BW ceiling to a lower one at constant Al

#### Why Does this matter?

- Some tools use the Hierarchical Roofline, some use cache-aware == Users need to understand the differences
- Cache-Aware Roofline model was integrated into production Intel Advisor
- Evaluation version of Hierarchical Roofline<sup>1</sup> (cache simulator) has also been integrated into Intel Advisor



#### **Hierarchical Roofline**

- Captures cache effects
- Al is Flop:Bytes after being filtered by lower cache levels
- Multiple Arithmetic Intensities (one per level of memory)
- Al dependent on problem size (capacity misses reduce Al)
- Memory/Cache/Locality effects are observed as decreased AI
- Requires performance counters or cache simulator to correctly measure Al

#### **Cache-Aware Roofline**

- Captures cache effects
- Al is Flop:Bytes as presented to the L1 cache (plus non-temporal stores)
- Single Arithmetic Intensity
- Al *independent* of problem size

- Memory/Cache/Locality effects are observed as decreased performance
- Requires static analysis or binary instrumentation to measure Al



#### **Example: STREAM**

- L1 Al...
  - 2 flops
  - 2 x 8B load (old)
  - 1 x 8B store (new)
  - = 0.08 flops per byte
- No cache reuse...
  - Iteration i doesn't touch any data associated with iteration i+delta for any delta.
- ... leads to a DRAM AI equal to the L1 AI

```
#pragma omp parallel for
for(i=0;i<N;i++){
    Z[i] = X[i] + alpha*Y[i];
}</pre>
```



#### **Example: STREAM**

#### **Hierarchical Roofline**

#### **Cache-Aware Roofline**







#### Example: 7-point Stencil (Small Problem)

#### L1 Al...

- 7 flops
- 7 x 8B load (old)
- 1 x 8B store (new)
- = 0.11 flops per byte
- some compilers may do register shuffles to reduce the number of loads.

#### Moderate cache reuse...

- old[ijk] is reused on subsequent iterations of i,j,k
- old[ijk-1] is reused on subsequent iterations of i.
- old[ijk-jStride] is reused on subsequent iterations of j.
- old[ijk-kStride] is reused on subsequent iterations of k.

#### ... leads to DRAM Al larger than the L1 Al



### Example: 7-point Stencil (Small Problem) Hierarchical Roofline Cache-Aware Roofline







### Example: 7-point Stencil (Small Problem) Hierarchical Roofline Cache-Aware Roofline







### Example: 7-point Stencil (Large Problem) Hierarchical Roofline Cache-Aware Roofline







### Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline







### Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline





