Arie Shoshani Retiring After 38 Years Managing Big and Bigger Data
June 5, 2014
By Jon Bashor
When he was fresh out of Princeton University with a Ph.D. in electrical engineering, Arie Shoshani passed up jobs in Phoenix (too hot) and Pasadena (too smoggy), instead joining a software company spun off from the RAND Corp. in Santa Monica. In 1969, System Development Corp. was one of the first nodes on ARPAnet and was using IMPs, or Interface Message Processors, to move packets of data between nodes.
Shoshani recalls how he was intrigued by the idea of data and distributed databases. “I thought, ‘I better look into this area of data movement. It can only get bigger. I was thinking about job security,” Shoshani said, while looking back on his 38 years at Berkeley Lab.
Since that early foray into data transfers, Shoshani has become an internationally known expert in scientific data management, leading both the Scientific Data Management Group at Berkeley Lab and serving as the principal investigator for three consecutive projects under DOE’s Scientific Discovery through Advanced Computing (SciDAC) program.
At Princeton, Shoshani concentrated on operating systems, specifically on “deadlocks,” which would cause an OS to crash. He noticed that the same problem could occur when moving data, particularly when there was no space to store the data at the intended destination. He realized it was really an issue with resource management. So, he wrote a paper about it, the first of many.
When System Development Corp. decided to become a for-profit company and take on commercial clients, Shoshani started looking around for a new position. He figured that Berkeley Lab would have some interesting data challenges and came to the lab to give a talk on distributed databases. Carl Quong, then head of the Computer Science and Mathematics Division, offered him a job and Shoshani moved to Berkeley.
When he arrived, the department was working on a contract project for the U.S. Department of Labor using socio-economic data, such as census figures, trade patterns, etc., with all the data stored in books. They would then produce maps to represent the data. “My job was to figure out what to do with the data – the people were great and knew about the data in detail. We came up with statistical databases in order to get more insight,” Shoshani said.
About the same time, Quong encouraged Shoshani to hold a workshop on scientific and statistical database management. The workshop evolved into a conference, now known as the Scientific and Statistical Database Management Conference, and has been held for 37 consecutive years. Due to his upcoming retirement, Shoshani will miss the 38th meeting, the first one he hasn’t attended.
After a few years at the lab, Shoshani noticed that other scientific disciplines, such as high-energy physics, were not using databases, but were using their own formats and models for processing data. Thesituation led to a proposal for a scientific data management program at Berkeley Lab, which DOE initially funded for $500,000 and has since become the base program that supports Shoshani’s group to this day.
Ahead of the Big Data Curve
As Shoshani sees it, he and his group have always been ahead of the curve when it comes to Big Data. In the area of high-energy physics, for example, experiments at colliders produce far more collisions, or events, than the scientists can get out of the system, leading to much more data than they can handle. Using hardware, they can reduce the millions of collisions to measure only the most energetic particles. But they still end up with megabytes of data on discs and tapes, which used to take months to sift through. One of DOE’s “Grand Challenges” in the 1990s was to be able to search large amounts of data much more efficiently.
Shoshani’s group tackled the challenge, which led to two key developments. One was the Storage Resource Manager, which was a tool for managing and moving cached datasets stored on tapes. This capability became even more important with the emergence of the Grid and distributed computing. The idea led to the development of two large-scale scientific grids which still support international research – the Earths Systems Grid for climate research and the Open Science Grid collaboration supporting research in high-energy physics, nanoscience and structural biology.
The second product of the Grand Challenge work was FastBit, an indexing method for large datasets that can perform searches more than 10 times faster than some of the leading commercial tools. Originally developed to help scientists find critical events in data pouring out of the STAR experiment in the Relatavistic Heavy Ion Collider at Brookhaven National Laboratory, FastBit was recognized with a 2008 R&D 100 Award . John Wu in Shoshani’s group was the lead developer.
In 2000, DOE launched the SciDAC program and then-NERSC Director Horst Simon encouraged Shoshani to submit a proposal for a scientific data management center. The proposal for the Scientific Data Management Integrated Software Infrastructure Center, a collaboration with Argonne, Lawrence Livermore, Pacific Northwest and Oak Ridge national labs, was funded for five years with Shoshani as the principal investigator. The purpose was not only to achieve efficient storage and access to the data using specialized indexing, compression and parallel storage and access technology, but also to make more effective use of the scientist’s time by providing specialized data-mining techniques, streamlining time-consuming tasks and automating the scientist's workflows.
When the SciDAC-2 program was launched, Shoshani was again selected to lead the Data Management Center for Enabling Technologies, another five-year partnership among five national labs and five universities. And in 2012, as Shoshani was contemplating retirement, he was chosen to lead SDAV, the Scalable Data Management, Analysis and Visualization Institute under the SciDAC-3 program.
“I was planning to retire, but was told ‘Arie, you should do it,’ so I said OK, I’ll stay another few years,” Shoshani said. “Now it’s time for me to take it easier, do a little more exercising, and take care of my wife and myself.”
Well-Deserved Recognition
In August 2013, Shoshani’s contributions to the scientific community were recognized with the Berkeley Lab Prize for Lifetime Scientific Achievement. Berkeley Lab Director Paul Alivisatos presented the award in front of an enthusiastic crowd filling the Building 50 auditorium. Alivisatos noted that the recently published book, The Fourth Paradigm: Data-Intensive Scientific Discovery, “caught up with where Arie was 25 years ago – well before the idea had concurrency and decades before others.” Shoshani laid the early foundations for understanding the massive datasets that are increasingly a hallmark of research, Alivisatos said.
“Arie was really out there early on, became a leader and has left his indelible mark on the field,” Alivisatos said.
Shoshani’s work with information-gathering instruments began when he joined the Israeli army after high school. After a number of live-ammunition exercises, he realized the infantry was not for him and volunteered to become a radar engineer in the air force. The radar systems were critical to Israel’s defense so when one broke down, Shoshani went to work. Even if he was on vacation, the air force would send a special truck to bring him back to the base. One time, after completing a repair at 3 a.m., the grateful base commander roused the head cook and ordered him to cook any dish Shoshani desired. As a 20-year-old, the gesture made him feel very important.
Shoshani continued to work part time for the air force while he attended Technion, Israel’s best technical school, where he earned a degree in electrical engineering. Determined to attend grad school in the U.S., Shoshani was accepted at both Princeton and UC Berkeley, but Princeton won out with more generous financial aid. There, he rubbed shoulders with the likes of Peter Denning and Jeff Ullman. Finally, though, he did make it to Berkeley.
“One of the great things about the lab is that you get to talk to people from different scientific areas, and to understand what is important for them,” he said. “This opens up new areas for research, which makes the work here even more interesting.”
About Berkeley Lab
Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.
DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.