New Tools for Sharing Wealth of Data to Study Global Resources Issues
January 20, 2009
As they strive to develop effective strategies for guarding water supplies, protecting endangered species and curbing greenhouse gases, environmental scientists are turning to innovative cyber-infrastructures and data-mining tools developed by an ongoing collaboration between researchers at Lawrence Berkeley National Laboratory, Microsoft Research, and the University of California, Berkeley.
The Microsoft e-Science program is the primary funder of this project, which is one of numerous ventures cultivated by the Berkeley Water Center (BWC). Launched approximately three years ago by researchers from the Berkeley Lab and UC Berkeley’s Colleges of Engineering and Natural Resources – the BWC marshals expertise from public institutions and the private sector in support of projects that enable science and public policy researchers to more easily access and work with water and environmental datasets.
“The most cost-efficient way to impact issues like global climate change and water management, is to develop cyber-architectures that organize data and foster scientific collaboration,” says Susan Hubbard, staff scientist in the Berkeley Lab’s Earth Sciences Division and associate director of the BWC.
Environmental scientists typically collect data on a project-by-project basis, like series of campaigns targeted at very specific topics. One study may use NASA satellites to track annual rainfall of deserts around the globe, while another project sponsored by the National Science Foundation (NSF) might measure the annual water-tables of the Sahara desert with commercial sensors. The data are then typically stored in local archive systems and accessed by researchers associated with that particular project. These sites are scattered across the country, tend to be aligned with specific campaigns, and are funded by a variety of organizations.
According to Catharine van Ingen, partner architect with Microsoft Research, this system can be cumbersome at times because observations are stored in data archives and access centers in the same format that it is deposited, and undergoes only very simple checks and transformations, making the data difficult to share with other scientists. She notes that much of this information is not science-ready. To fulfill this requirement the data must cataloged, checked, and processed to eliminate obvious problems caused by battery loss, transcription errors, or environmental factors such as freezing rain or birds.
In most cases, scientists also cannot withdrawal data from these centers during non-business hours, and so many researchers opt to retain their observations on their own personal desktop computers. If other researchers want to use this data, they have to contact the lead scientist and have him/her e-mail this information to them.
“One of the greatest challenges of the next century will be developing cyber-architectures that allow scientists to easily navigate their digital assets. Today, the internet has given environmental researchers instant access to a wealth of field data. Now, they need a scientific ‘safety deposit box’ system that will not only store this information, but also organize it so it is searchable and ready for analysis,” says van Ingen.
Designing a Data-Sharing Platform for the Long Run
According to Deb Agarwal, head of CRD’s Advanced Computing for Science Department at Berkeley Lab and member of the BWC, the computing needs of many e-science researchers fall into the gap between the typical supercomputer user and the desktop computer user.
“An environmental dataset is often one-terabyte or smaller in size, these datasets can be stored easily on a desktop hard drive. This means that the hardware needed to create a centralized database is extremely inexpensive and is not the limiting factor. Instead, usability and longevity of the data is the issue,” she says.
Agarwal’s team worked with existing Microsoft tools initially to develop a prototype database for data collected by the AmeriFlux network. For over 10 years, the AmeriFlux collaboration of field researchers has tracked carbon dioxide exchange between plants and soil on the ground with the planet’s atmosphere, on an hourly basis, and in more than 120 sites across North, Central and South America. The sites represent a range of ecosystems, from the Arctic tundra to North American prairies, and Amazonian rainforests. Since its inception almost two years ago, the database has grown to include data from Fluxnet, which incorporates AmeriFlux counterparts around the world including, Asia, Africa, Australia, Europe, including Siberia.
The Fluxdata Scientific Data Server now includes: semi-automated ingest tools to automatically extract important aspects of incoming data; a database and schema to organize and archive information; data cubes that allow researchers to look at the data from multiple perspectives; and tools which automatically convert multiple data versions into one format. The new architecture also enables researchers to browse data and reports via internet and collaborate with each other. This means scientists no longer need to download and interpret the raw data from a data collection center. Instead they can browse, mine, and do research on the data without needing to download and process it first.
Once this server architecture proved to be successful, e-science team members applied this “cyber-blueprint” to create searchable central repositories for the variety of field data collected from California’s Russian and Pajaro Rivers. Currently, the team members are collaborating with the National Marine Fisheries Service to aid research involving fish recovery efforts in Northern California coastal streams, and will soon develop a server than encompasses observational information about all the watersheds in California.
“In the past, the computing needs of environmental researchers have often been overlooked because they are rarely on the leading edge of computational or scale requirements of the scientific community, and collectively are not a big enough customer to be commercially profitable. Despite this, their computing challenges are substantial and solving them is essential to their work helping us understand climate change, and our surrounding environment,” says Agarwal.
According to Jim Hunt, professor of civil engineering at UC Berkeley and co-director of the BWC, relatively basic questions like, how the annual water balance in the Russian River watershed changed in the past decade, were not exactly impossible to answer before the e-Science data-mining tools were developed. However, the task of gathering data from a variety of organizations, reformatting the data to make it consistent, sifting out the important pieces of information, and calculating the balances, was such a time-consuming and tedious task, that most scientists didn’t want to tackle the issue. He notes that the new e-science tools can produce this answer in minutes. In addition, the data cube architecture allows scientists to find many different relationships in the datasets.
“Everything in an ecosystem is interconnected. Changes in one particular ecosystem could have global consequences, and tools like the data cube make it easier for us to see the big picture … we can now inquire about more complex relationships like, how do the changes in a watershed’s annual water balance affect the amount of carbon dioxide in its surrounding atmosphere,” says Dennis Baldocchi, professor Biometeorology at UC Berkeley.
“The answers to these types of questions will allow us to make accurate predictions about the future of such watersheds, and in-turn helps us develop more effective strategies for managing these resources,” adds Hunt.