# Guiding **Optimization on KNL** with the **Roofline Model**





























## What is different about Cori?

### Edison ("Ivy Bridge):

- 5576 nodes
- 24 physical cores per node
- 48 virtual cores per node
- 2.4 3.2 GHz
- 8 double precision ops/cycle
- 64 GB of DDR3 memory (2.5 GB per physical core)
- ~100 GB/s Memory Bandwidth

### **Cori ("Knights Landing"):**

- 9304 nodes
- 68 physical cores per node
- 272 virtual cores per node
- 1.4 1.6 GHz
- 32 double precision ops/cycle
- 16 GB of fast memory 96GB of DDR4 memory
- Fast memory has 400 500 GB/s No L3 Cache





# How to Enable NERSC's diverse community of 7,000 users, 750 projects, and 700 codes to run on advanced architectures like Cori and beyond?







# 1. Need a sense of absolute performance when optimizing applications.

- How Do I know if My Performance is Good?
- How Do I know when to stop?
- Why am I not getting peak performance? Or the predicted ceiling for my application?

# 2. Many potential optimization directions:

- How do I know which to apply?
- What is the limiting factor in my app's performance?
- How do I know when to stop?







Optimizing Code For Cori is like:

# A. A Staircase ?

A. A Labyrinth ?

# A. A Space Elevator?





### *(More) Optimized Code*







### Are you memory or compute bound? Or both?



If your performance changes, you are at least partially memory bandwidth bound









### Are you memory or compute bound? Or both?



If your performance changes, you are at least partially compute bound









# **Roofline helps visualize this information! Guides optimizations**

### **WARP Optimizations:**

- Add tiling over grid targeting L2 cache on both Xeon-Phi Systems
- Add particle sorting to further improve locality and memory access pattern
- Apply vectorization over particles















# **NESAP Example**

















Optimization process for Kernel-C (Sigma code):

- 1. Refactor (3 Loops for MPI, OpenMP, Vectors)
- 2. Add OpenMP
- 3. Initial Vectorization (loop reordering, conditional removal)
- 4. Cache-Blocking
- 5. Improved Vectorization (Divides)
- 6. Hyper-threading









### **Optimization Process**

# HaswellKNL (DDR)KNL (HBM)



### Vectorization





### ngpown typically in 100's to 1000s. Good for many threads.

Original inner loop.

ncouls typically in 1000s - 10,000s.

Attempt to save work breaks vectorization



### Change in Roofline



The loss of L3 on MIC makes locality more important.





### **KNL Roofline Optimization Path**



| !\$OMP DO                                          |                                       |  |
|----------------------------------------------------|---------------------------------------|--|
| do my_igp = 1, ngpown<br>do iw = 1 , 3             | Required Cache size to reuse 3 times: |  |
| do ig = 1, igmax                                   | 1536 КВ                               |  |
| load wtilde_array(ig,my_igp) 819 MB, 512KB per row |                                       |  |
| load aqsntemp(ig,n1) 256 MB, 512KB per row         | L2 on KNL is 512 KB per core          |  |
| load I_eps_array(ig,my_igp) 819 MB, 512KB per row  | L2 on Has. is 256 KB per core         |  |
| do work (including divide)                         | L3 on Has. is 3800 KB per core        |  |

Without blocking we spill out of L2 on catch us.





# KNL and Haswell. But, Haswell has L3 to

per core per core B per core



| !\$OMP DO                                           |                                       |
|-----------------------------------------------------|---------------------------------------|
| do my_igp = 1, ngpown<br>do igbeg = 1, igmax, igblk | Required Cache size to reuse 3 times: |
| do iw = 1 , 3                                       | 1536 КВ                               |
| do ig = igbeg, min(igbeg + igblk,igmax)             |                                       |
| load wtilde_array(ig,my_igp) 819 MB, 512KB per row  | L2 on KNL is 512 KB per core          |
| load aqsntemp(ig,n1) 256 MB, 512KB per row          | L2 on Has. is 256 KB per core         |
| load I_eps_array(ig,my_igp) 819 MB, 512KB per row   | L3 on Has. is 3800 KB per core        |
| do work (including divide)                          |                                       |
|                                                     |                                       |

Without blocking we spill out of L2 on catch us.





# KNL and Haswell. But, Haswell has L3 to

per core per core per core



### **Cache Blocking Optimization**



### KNL Roofline Optimization Path









### Cache Blocking Optimization (Hierarchical Roofline)

**Original Code** 

### Cache-Blocking Code









### Found significant x87 instructions from 1/complex\_number instead of AVX/AVX-512

| •           | In current project> - Intel VTune Amplifier                                |       |               |      |                                  |                                         |  |  |  |
|-------------|----------------------------------------------------------------------------|-------|---------------|------|----------------------------------|-----------------------------------------|--|--|--|
| <b>dx</b> ( | 🖆 💿 🕨 🗃 🕐 🕼 Welcome r012ah test_divide2 test_cpl X                         |       |               |      |                                  | =                                       |  |  |  |
|             | dvanced Hotspots Hotspots viewpoint (change) 💿                             |       |               |      |                                  | Intel VTune Amplifier XE 2015           |  |  |  |
| L _         |                                                                            | allee | 🐣 Ton-down To | ee 📼 | Tasks and Frames Pagnokernel     |                                         |  |  |  |
|             |                                                                            |       |               |      |                                  |                                         |  |  |  |
| 500         | rce Assembly 🗉 🖹 🖗 🕫 🧐 🐑 🔍 Assembly grouping: Address                      |       | -             |      |                                  | CPUL                                    |  |  |  |
| s.          |                                                                            | L-1   |               | Sou  |                                  |                                         |  |  |  |
| Li.▲        | Source                                                                     |       | Address 🔺     | Line | Assembly                         | Effective Time by Utilization           |  |  |  |
|             |                                                                            | 0 K   |               |      |                                  | 🛛 Idle 📕 Poor 📋 Ok 📕 Ideal 📮 Over       |  |  |  |
| 466         | <pre>scht = scht + scha(ig)</pre>                                          |       | 0x408745      | 481  | vunpckhpd %ymm3, %ymm3, %ymm3    | 0.001s                                  |  |  |  |
| 467         | endif                                                                      |       | 0x408749      | 480  | vmovapd %xmm5, %xmm15            |                                         |  |  |  |
| 468         |                                                                            |       | 0x40874d      | 480  | vmovsdq %xmm15, -0x28(%rbp)      | 0.202s                                  |  |  |  |
| 469         | else                                                                       |       | 0x408752      | 480  | fldq -0x28(%rbp), %st0           | 0.456s                                  |  |  |  |
| 470         | ! !dir\$ no unroll                                                         |       | 0x408755      | 480  | vunpckhpd %xmm5, %xmm5, %xmm11   | 0.001s                                  |  |  |  |
| 471         | <pre>do ig = igbeg, min(igend,igmax)</pre>                                 | 0     | 0x408759      | 480  | fld %st0, %st0                   |                                         |  |  |  |
| 472         | ! do ig = 1, igmax                                                         |       | 0x40875b      | 480  | vmovsdq %xmm11, -0x28(%rbp)      | 0.184s                                  |  |  |  |
| 473         |                                                                            |       | 0x408760      | 480  | fmul %st1, %st0                  | 0.444s                                  |  |  |  |
| 474         | <pre>wdiff = wxt - wtilde_array(ig, my_igp)</pre>                          | 2     | ONTOOTOL      | 480  | vextractf128 \$0x1, %ymm5, %xmm9 | 0.006s                                  |  |  |  |
| 475         |                                                                            |       | 0x408768      | 480  | fldq -0x28(%rbp), %st0           |                                         |  |  |  |
| 476         | cden = wdiff                                                               |       | 0x40876b      | 480  | fld %st0, %st0                   | 0.183s                                  |  |  |  |
| 477         | Irden = cden * CONJG(cden)                                                 |       | 0x40876d      | 480  | fmul %stl, %st0                  | 0.418s                                  |  |  |  |
| 478         | !rden = 1D0 / rden                                                         |       | 0x40876f      | 480  | vmovsdq %xmml2, -0x28(%rbp)      | 0.006s                                  |  |  |  |
| 479         | !delw = wtilde_array(ig,my_igp) * CONJG(cden) * rden                       |       | =: 0x408774   | 480  | faddp %st0, %st2                 | 0.001s                                  |  |  |  |
| 480         | cden = 1 /cden                                                             | 45    | 0x408776      | 480  | fxch %stl, %st0                  | 0.196s                                  |  |  |  |
| 481         | <pre>delw = wtilde_array(ig,my_igp) * cden</pre>                           | 3     | 0x408778      | 480  | fdivr %st3, %st0                 | 0.4625                                  |  |  |  |
| 482         | delwr = delw*CONJG(delw)                                                   | 4     | 0x40877a      | 480  | fldq -0x28(%rbp), %st0           | 0.1135                                  |  |  |  |
| 483         | wdiffr = wdiff*CONJG(wdiff)                                                | 3     | 0x40877d      | 480  | vmovsdq %xmm7, -0x28(%rbp)       | 0.1925                                  |  |  |  |
| 484         |                                                                            |       | 0x408782      | 480  | fld %st0, %st0                   | 0.4185                                  |  |  |  |
| 485         | ! JRD: Complex division is hard to vectorize. So, we help the compiler.    |       | 0x408784      | 480  | fmul %st4, %st0                  | 0.0015                                  |  |  |  |
| 486         | scha(ig) = mygpvarl * aqsntemp(ig,nl) * delw * I_eps_array(ig,n            |       | -             | 480  | fxch %stl, %st0                  | 0.025s                                  |  |  |  |
| 487         | <pre>! scha_temp = mygpvarl * aqsntemp(ig,nl) * delw * I_eps_array(;</pre> |       | 0x408788      | 480  | fmul %st3, %st0                  | 0.6025                                  |  |  |  |
| 488         |                                                                            |       | 0x40878a      | 480  | fldq -0x28(%rbp), %st0           | 0.002s                                  |  |  |  |
| 489         | ! JRD: This if is OK for vectorization                                     |       | 0x40878d      | 480  | fld %st0, %st0                   | 0.026s                                  |  |  |  |
| 490         | if (wdiffr.gt.limittwo .and. delwr.lt.limitone) then                       | 6     | 0x40878f      | 480  | fmulp %st0, %st5                 | 0.1855                                  |  |  |  |
| 491         | <pre>scht = scht + scha(ig)</pre>                                          | 3     | 0x408791      | 480  | vunpckhpd %xmm9, %xmm9, %xmm4    | 0.4045                                  |  |  |  |
| 492         | endif                                                                      |       | 0x408796      | 480  | fxch %st4, %st0                  | 05                                      |  |  |  |
|             | Selected 1 row(s):                                                         |       |               |      |                                  |                                         |  |  |  |
|             | ×                                                                          | 11    |               |      | <u>.</u>                         | ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( |  |  |  |

```
-fp-model fast=2
```





### Can significantly speed up by using



### Additional Speedups from Hyperthreading



























### BerkeleyGW Use Case

- Big systems require more memory. Cost scales as N<sub>atoms</sub>^2 to store the data.  $\star$
- In an MPI GW implementation, in practice, to avoid communication, data is duplicated and  $\star$ each MPI task has a memory overhead.
- ★ Users sometimes forced to use 1 of 24 available cores, in order to provide MPI tasks with enough memory. 90% of the computing capability is lost.









In house code (I'm one of main developers). Use as "prototype" for App Readiness.

Significant Bottleneck is large matrix reduction like operations. Turning arrays into numbers.

$$\langle n\mathbf{k} | \Sigma_{\rm CH}(E) | n'\mathbf{k} \rangle = \frac{1}{2} \sum_{n''} \sum_{\mathbf{q} \mathbf{G} \mathbf{G}'} M_{n''n}^*(\mathbf{k}, -\mathbf{q}, -\mathbf{G}) M_{n''n'}(\mathbf{k}, -\mathbf{q}, -\mathbf{G}') \\ \times \frac{\Omega_{\mathbf{G} \mathbf{G}'}^2(\mathbf{q}) \left(1 - i \tan \phi_{\mathbf{G} \mathbf{G}'}(\mathbf{q})\right)}{\tilde{\omega}_{\mathbf{G} \mathbf{G}'}(\mathbf{q}) \left(E - E_{n''\mathbf{k} - \mathbf{q}} - \tilde{\omega}_{\mathbf{G} \mathbf{G}'}(\mathbf{q})\right)} v(\mathbf{q} + \mathbf{G}')$$









### So, you are Memory Bandwidth Bound?



Identify the key arrays leading to high memory bandwidth usage and make sure they are/will-1. be allocated in HBM on Knights Landing.

Profit by getting ~ 4-5x more bandwidth GB/s.







# What to do?

1. Make sure you have good OpenMP scalability. Look at VTune to see thread activity for major **OpenMP** regions.



Make sure your code is vectorizing. Look at Cycles per Instruction (CPI) and VPU utilization 1. in vtune.

See whether intel compiler vectorized loop using compiler flag: -qopt-report=5









You may be memory latency bound (or you may be spending all your time in IO and Communication).

If running with hyper-threading improves performance, you \*might\* be latency bound:

If you can, try to reduce the number of memory requests per flop by accessing contiguous and predictable segments of memory and reusing variables in cache as much as possible.

On Knights-Landing, each core will support up to 4 threads. Use them all.







### Are you memory or compute bound? Or both?







### ound



Cray XC40 system with 9,600+ Intel Knights Landing (KNL) nodes:

- 68 cores, 272 Hardware Threads
- Up to 32 FLOPs per Cycle, 1.2-1.4 GHz Clock Rate
- Wide (512 Bit) vector Units
- Multiple Memory Tiers: 96 GB DRAM / 16 GB HBM
- NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec



