Careers | Phone Book | A - Z Index

Sifting Through a Trillion Electrons


SDM's Surendra Byna and colleagues from Berkeley Lab’s Computational Research Division teamed up with researchers to develop novel software strategies for storing, mining, and analyzing massive datasets generated by a state-of-the-art plasma physics code called VPIC. » Read More

Catching Turbulence in the Solar Wind


Massive datasets plus modelling, visualization and analytics allow researchers to "see" the unseen: the turbulence in solar winds. » Read More

Arie Shoshani Earns Lifetime Achievement Award

Arie award

More than 25 years ago, Arie Shoshani realized that researchers were facing significant challenges in organizing, managing and analyzing their scientific data. He set out to develop computer applications to help them better meet the challenges and created the Scientific Data Management Group in the process. » Read More

The Scientific Data Management (SDM) group develops technologies and tools for efficient data access and storage management of massive scientific data sets. We are currently developing storage resource management tools, data querying technologies, in situ feature extraction algorithms, along with software platforms for exascale data. The group also works closely with application scientists to address their data processing challenges. These tools and application development activities are backed by active research efforts on novel algorithms for emerging hardware platforms.

Group Leader: John Wu

»Visit the Scientific Data Management (SDM) site.

Current Projects

ICEE: International Collaboration Framework for Extreme Scale Experiments

Large-scale scientific exploration in domains such as high-energy physics, fusion, and climate are based on international collaborations. As these collaborations produce more and more data, the existing workflow management systems are hard pressed to keep pace. A necessary solution is to process, analyze, summarize and reduce the data before it reaches the relatively slow disk storage system, a process known as in transit processing (or in-flight analysis). We propose to dramatically increase the data handling capability of collaborative workflow systems by leveraging the popular in transit processing system known as ADIOS, and integrating this with FastBit to provide selective data accesses. These new features will contribute to a new collaborative system named ICEE that aims at significantly improving the data flow management for distributed workflows.

FastBit: An Efficient Compressed Bitmap Index Technology

FastBit is an open-source data processing library providing searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. The key technology underlying the FastBit software is a set of compressed bitmap indexes. In database systems, an index is a data structure to accelerate data accesses and reduce the query response time. Most of the commonly used indexes are variants of the B-tree, such as B+-tree and B*-tree. FastBit implements a set of alternative indexes called compressed bitmap indexes. Compared with B-tree variants, these indexes provide very efficient searching and retrieval operations, but are somewhat slower to update after a modification of an individual record.

SDAV: SciDAC Scalable Data Management, Analysis, and Visualization Institute

The SciDAC SDAV Institute will actively work with application teams to assist them in achieving breakthrough science and will provide technical solutions in the data management, analysis, and visualization regimes that are broadly applicable in the computational science community. As the scale of computation has exploded, the data produced by these simulations has increased in size, complexity, and richness by orders of magnitude, and this trend will continue. Users of scientific computing systems are faced with the daunting task of managing and analyzing their datasets for knowledge discovery, frequently using antiquated tools more appropriate for the teraflop era. While new techniques and tools are available that address these challenges, often application scientists are not aware of these tools, aren't familiar with the tools' use, or the tools are not installed at the appropriate facilities. SDAV will deploy, and assist scientists in using, technical solutions addressing challenges in three areas: • Data Management – infrastructure that captures the data models used in science codes, efficiently moves, indexes, and compresses this data, enables query of scientific datasets, and provides the underpinnings of in situ data analysis • Data Analysis – application-driven, architecture-aware techniques for performing in situ data analysis, filtering, and reduction to optimize downstream I/O and prepare for in-depth post-processing analysis and visualization • Data Visualization – exploratory visualization techniques that support understanding ensembles of results, methods of quantifying uncertainty, and identifying and understanding features in multi-scale, multi-physics datasets

ADAPT: Adaptive Data access and Policy-driven Transfers

Large-scale science applications are expected to generate exabytes of data over the next 5 to 10 years. With scientific data collected at unprecedented volumes and rates, the success of large scientific collaborations will require that they provide distributed data access with improved data access latencies and increased reliability to a large user community. To meet these requirements, scientific collaborations are increasingly replicating large datasets over high-speed networks to multiple sites. The main objective of this work is to develop and deploy a general-purpose data access framework for scientific collaborations that provides lightweight performance monitoring and estimation, fine- grained and adaptive data transfer management, and enforcement of site and VO policies for resource sharing. Lightweight mechanisms will collect monitoring information from data movement tools without putting extra loads on the shared resources.

APM - Advanced Performance Model

To improve the efficiency of resource utilization and scheduling of scientific data transfers on high-speed networks, we started a project on Advanced Performance Modeling with combined passive and active monitoring (APM) that investigates and models a general-purpose, reusable and expandable network performance estimation framework. The predictive estimation model and the framework will be helpful in optimizing the performance and utilization of fast networks as well as sharing resources with predictable performance for scientific collaborations, especially in data intensive applications.


Computational Infrastructure for Financial Technologies

More from Current Projects »