Quest for Speed Leads CRD’s Ibrahim to Accelerating Supercomputing Applications

Performance Tuning Expertise Leads to SC14 Recognition

January 23, 2015

by Jon Bashor

Khaled Ibrahim

As a boy growing up in Egypt, Khaled Ibrahim was fascinated with learning about the things that were the fastest, strongest or biggest, whether it was a car, a horse or even a camel. His dream was to harness that speed.

“I thought I would enjoy riding a fast horse, or driving a car fast, but I never got much fun out of it and it was mostly a scary experience,” he says now. “HPC fulfilled my desire of having a fast ride, in my case racing to find a scientific answer.”

As a member of the Computer Languages and Systems Software Group in the Computational Research Division, Ibrahim specializes in performance tuning – going under the hood of supercomputers and figuring out how to increase their performance while running scientific applications.

“Performance tuning for HPC computation aims at having the shortest time to an answer for a question of interest in science,” Ibrahim said. “This typically translates to increasing the efficiency to perform a computation or making an answer feasible in the first place.”

At the SC14 conference held in November 2014, Ibrahim’s tuning expertise helped him win the HPC Challenge for the fastest performance of a Fast Fourier Transformation (FFT) application. He tuned his application to achieve 226 teraflop/s running on “Mira,” IBM BlueGene Q supercomputer at Argonne National Laboratory. His result was 9.7 percent faster than the runner-up, which ran on Japan’s K computer.

Although it was Ibrahim’s first time entering the HPC Challenge, his entry was an extension of the work he does at Berkeley Lab—tuning science applications and runtimes for better performance and working on advanced computer architectures.

Tuning applications to perform better helps researchers make better use of the time they are allocated on supercomputers and more efficient applications mean supercomputing centers can deliver more science for the amount of electricity used. Applications that run slowly are often waiting for data to be communicated between processors and memory, meaning they are running without producing any results. Although these idle times are measured in fractions of a second, when an application is running on thousands of processors, the amount of unproductive time and energy can add up. And since supercomputers have different architectures and software, performance tuning is not one size fits all.

“The challenge is that it does not only require understanding the computational algorithms but also underlying system capabilities. In other words, performance tuning requires more listening to the system before talking to it,” Ibrahim said. “By listening, I mean understanding the strengths and weaknesses of a system. By talking, I mean communicating the algorithm and its implementation.”

Ibrahim said that the communication between humans and computing systems crosses a wide spectrum of complexity. At one extreme is the Siri model, where the computing system tries to understand our natural language. “The other extreme, used in performance tuning for HPC, is trying to speak in a language that makes it easier for the computing system to understand our needs and perform them efficiently,” he said. “In an ideal world, we could combine both.”

According to Ibrahim, HPC systems are growing in complexity due to heterogeneity (using different types of processors) and deeper memory hierarchy (increasing the levels to exploit locality of data). This makes performance tuning dependent on acquiring skills in multiple areas including single core tuning, shared memory programming, vector processing and distributed memory programming.

As an example of the complexity of the problem, Ibrahim points to the Cray supercomputers named Hopper and Edison at the National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab. Hopper, a Cray XE6, is designed to have just a few neighboring processors around each processor, so efficiently broadcasting data to many processors requires a tree-like pattern, with data branching out to an increasing broader range of processors. But Edison’s architecture includes an interconnect with higher connectivity allowing data to be sent directly to a large number of neighboring processors. Each architecture requires a different approach to performance tuning.

“It’s not always clear which approach is the most efficient, which will help reaching an answer using the least amount of energy or time,” he said. “There’s usually a tradeoff between ease and efficiency. Generally, computing systems are kind enough to reveal their secrets if we give them enough listening time, attention, and ask them the right questions.”

This was especially true when Ibrahim ran his applications on Mira for the HPC Challenge. Only a small percentage of the time run by the FFT benchmark was used in the local computation on the processors -- most of the time was in communicating data. Overall, the FFT only took seven seconds. Finding ways to tune the benchmark for making that efficient run time was critical.

Mira was designed using system-on-a-chip architecture, meaning that most computation and communication steps are handled within the processor unit, rather than between processors and separate peripherals. Typically, small subsets of data are moved into the cache on the chip, then sent back to memory after processing. The communication phase follows that, sending the data from the memory through the network device. By tuning the application to communicate data while it’s still in the cache, the efficiency is increased by avoiding time spent moving the data back and forth with the main memory. But because the overall run time was so short and cache space is limited, this fusion of computation with communication by the application needs a low-overhead software stack.

Ibrahim said that this motivated him to move data using many small transfers, which in some ways is counter-intuitive. For other architectures, aggregating the data into larger transfers is the way to improve performance.

But getting to the heart of the problem can be difficult. Supercomputers run layers of inter-operating software known as the “software stack.” These stacks play the role of a translator between the system and the user.

“But you need to take special care because such translation may blur the true image of the raw capabilities of a system,” Ibrahim said. “Dealing with that may require bypassing such stacks, which requires time, patience and dedication. But I can afford that, if I will be rewarded with the fast ride at the end.”

And unlike a headstrong horse or a stubborn person, computing systems aren’t biased -- if you give them the right instructions, they execute them precisely and efficiently, Ibrahim said.

About Berkeley Lab

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.