2023 Hopper Fellow Embraces the Challenges of Large-Scale Science
November 21, 2023
By Keri Troutman
David Tench, the 2023 Grace Hopper postdoctoral fellow at Lawrence Berkeley National Laboratory (Berkeley Lab), describes discovering his fellowship opportunity as a “Goldilocks moment.”
After years in academia spent developing a very theoretical but powerful algorithm to tackle the analysis of enormous scientific datasets, he was looking for a place where he could match theory with real-world, large-scale science. He spent his time as a Ph.D. student at the University of Massachusetts Amherst developing algorithms to process massive, dynamically changing graphs. (Graphs, also called networks, are used to model connections between items, such as friendships among people, papers written by authors, and highways connecting cities.) The challenge was datasets that were so large they couldn’t be stored on a computer, and the algorithm needed to be able to determine patterns and results from the changing nature of the graph.
While this was fascinating work, it wasn’t being applied in real-world scientific settings. He hopes that his time working in Berkeley Lab’s Applied Mathematics and Computational Research Division’s Performance and Algorithms Research group moves his research to this next phase.
“Berkeley Lab’s mission statement is all about large-scale fundamental science for its own sake, and there is a lot of institutional prioritization of interdisciplinary work, which is crucial for my goals,” Tench says. “The next step is finding the real-world applications now that we've built the technology—and there are so many people here with large-scale computing needs.”
Not All Large-Scale Graphs are Sparse
Many current software systems are designed to study large networks that change over time, but they focus on graphs that are very sparse, meaning that each node in the graph doesn’t have many edges, he adds. “With too many edges, you run out of space.”
To give some context to this idea of a network expressed computationally as a graph, Tench points to social media platforms like Facebook—massive and dynamic social media connections are constantly being made and deleted on the order of billions of users. Call those users “nodes” and their many social media connections “edges,” and you’ve got a massive, but sparse, graph. The graph is sparse because most users have, on average, about 100 friends. For a social network graph to be dense, most people would have to be Barack Obama-level famous with millions of friends, so it makes sense that these kinds of graphs are sparse, Tench explains.
However, there are other kinds of graph datasets for which this sparse-only argument doesn't apply, he adds. For instance, graphs representing pairwise chemical interactions between proteins don't have to be sparse because for any particular protein, there's no obvious upper limit on the number of other proteins it may react with. “If you survey all the graph datasets published in academic literature, you'll find examples of small graphs that are sparse and small graphs that are dense, but the only large graphs to be found are sparse,” Tench says.
“This dovetails with a widely held belief in computer science about graphs as they exist in the real world—the belief that large-scale graphs are always sparse,” he explains.
“But I have an alternative potential explanation, which is that maybe this is an expression of the capabilities we currently have,” Tench says. “People are only publishing about and working on the things for which we already have tools that work really well.”
A Circuitous Route
This pondering of the “why” of the state of science comes from Tench’s beginnings as a theoretical computer scientist. His way into computer science was circuitous—Tench studied math as an undergrad and purposely avoided computer science until his senior year when a major requirement landed him in a complexity theory in computing sciences course. “It was the most fun I'd had in a technical class in college, and it happened to be in computer science.” Based on that experience alone, he applied to grad school and moved on to study computer science at UMass Amherst.
“I was fascinated by all the cool algorithmic techniques that existed,” he says. “So a lot of my research since then has been looking at these ideas and asking the question: what do we need to do to get to the point where people would actually use these techniques, and how would they help us do larger scale science?”
While there are a lot of complicated answers to that question, “one of the things that's really key is that you need to talk to a lot of scientists, find out what they care about, what they want to accomplish, and then try to help them come up with better ways to do those things,” he added. “That was what attracted me to Berkeley Lab in particular: there's this incredible interdisciplinary mix here, and it seemed like the kind of place where I could learn what it is these scientists need computationally and then figure out how we can take those ideas and use them to help people and further science.”
Linear Sketching of Algorithms
From his time as a CRA Computing Innovation Postdoctoral Fellow at Rutgers University in 2021, Tench has working prototypes that he’s ready to put to the test. These algorithms use a technique called linear sketching, and they serve two purposes. First, they are compact: they are able to sift through the huge volume of information about the graph and retain the bare minimum information required to answer questions about the graph's structure. This tiny but powerful data structure is called a sketch. Everything else can be thrown away, drastically reducing the memory requirement for the algorithm (and consequently allowing computation on larger graphs). Second, sketches are communication-efficient: they minimize the amount of communication required between different parts of the computer (CPU caches, RAM, and disk), which means they can be run efficiently even when the input graph is so enormous that its sketch is larger than available RAM.
“Graph algorithms are such a fundamental area of computer science that they tend to pop up all over the place, and there are in fact lots of problems in lots of areas that you could think of as a graph problem if you have the right capabilities for it,” says Tench. “Just recently here at the Lab, I was talking with a scientist who was challenged with processing particle accelerator data; tracing the trajectory of these particles could be thought of as a graph with a dynamic nature because we're looking at these sets of particles that are changing positions over time.”
One of the things researchers have recently discovered is that, in theory, the techniques that can be used to reduce the size of these graphs also yield algorithms that are very efficient when it comes to minimizing data movement, he adds. “We’re minimizing the amount of communication between the different parts of a system, which is often the thing that slows you down the most,” says Tench.
One challenge in applying the sketching algorithms Tench works with is that they have high computational cost; every time an edge in the graph changes, the algorithm must perform a computationally intensive calculation to compress that information into its tiny sketch. In this sense, the algorithms trade higher computational costs for reduced space and data movement costs.
“We have learned that this high computational cost is extremely parallel, and in prototype implementations, we have seen three orders of magnitude speedups on architectures such as consumer-grade clusters and GPUs,” Tench says. “I'm excited to see how much faster we can go using the massive parallelism available from the supercomputers at the National Energy Research Scientific Computing Center at Berkeley Lab.”
About Berkeley Lab
Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.