Berkeley Lab Technology for Speeding up Searches of Large Databases Wins R&D 100 Award
July 21, 2008
Contact: Jon Bashor, email@example.com, 510-486-5849510-486-5849
BERKELEY, Calif.—An indexing technology developed by researchers at Lawrence Berkeley National Laboratory which allows users to search massive datasets up to 40 times faster has been recognized with a 2008 R&D 100 award.
The award, presented by R&D Magazine, “provides a mark of excellence known to industry, government, and academia as proof that the product is one of the most innovative ideas of the year.” The awards will be presented at a special ceremony in Chicago in October.
FastBit was developed by Kesheng “John” Wu, Arie Shoshani, Ekow Otoo, and Kurt Stockinger of the Scientific Data Management Group in Berkeley Lab’s Computational Research Division. The project was supported by the Department of Energy’s Scientific Discovery through Advanced Computing program, which develops software for accelerating computational science.
“Since we began developing FastBit to help physicists search massive datasets resulting from large-scale experiments, we’ve seen an explosion of data in many other fields, too,” said John Wu, who led the project. “Since we released FastBit under an open-source license in 2007, it has attracted a lot of interest in new areas, such as network traffic data analysis, and business applications, drug discovery and web content delivery.”
Thanks to ubiquitous computers, smart phones and other electronic devices helping drive our fast-paced world, the amount of data being generated, collected and stored is growing exponentially. While this mountain of data contains lots of useful information, the ever-increasing size of these datasets makes it increasingly difficult to quickly find critical information. This problem poses a growing challenge in fields such as science, national security, medicine, finance and industry. One of the contributors to “big data” can be the search tools themselves, which actually create search indexes several times larger than the original collection. FastBit is an indexing method which can search huge databases 40 times faster than the fastest commercially available tools.
FastBit was originally designed to meet the needs of particle physicists who need to sort through billions of data records just to find 100 key pieces of information. For example, in a high-energy physics experiment called STAR, colliding particles generate billions of collision events, but only a few hundred of these events have the most distinctive signatures of the new state of matter called Quantum-Gluon Plasma (QGP). Finding evidence of QGP is a key objective of the STAR experiment.
But researchers in other fields have also found FastBit’s rapid search capability very useful. Cyber attacks known as “distributed denial of service attacks” can quickly bring down web servers using the network and an army of “drone” computers taken over for the sole purpose of launching a cyber attack. Finding out who controls these drones is key to identifying the people launching the attack. Sifting through the huge number of data records to find a relative small number of items that contain the nugget of insight is a daunting challenge.
This creates a great demand for accelerated search techniques known as indexing methods. Traditional indexing methods such as B-trees fail to meet such demanding data analysis needs. Bitmap indexes have been shown to be promising after a number of years of active use, but their effectiveness was limited to certain types of data, primarily those variables with a relative small number of possible values, such as the sex of a customer or the state in which the customer lives. FastBit is based on a new patented bitmap compression method that overcomes these barriers. It greatly improves the speed of searching operations, and expands the types of data that bitmap indexes can work with efficiently.
According to Wu, the most commonly used indexing methods, B-tree, and similar tree-based indexes, were designed for data that change rapidly over time. They are optimized for finding a small number of records, such as finding one’s bank account, and are not well suited for locating thousands records as it is common in analyses of large datasets. Hash-based indexing methods have similar shortcomings.
Additionally, B-tree indexes inflate the volume of the data by a factor of 3-4, which is not generally acceptable when dealing with terabyte-size datasets. Other tree-based indexing techniques are specialized to the task at hand; for example, Oct-tree and KD-tree indexes are designed for searching spatial data based on their spatial distribution. Such indexes are tailored for searching regions on a single variable, such as finding all hot regions in a climate database, but do not work well when multiple variables are searched together, such as finding hot regions where wind velocity is high.
The bitmap index is a class of indexing methods that accelerate searching operations on large datasets that do not change over time. They also work well for data that grow overtime, which is especially true in scientific applications. For this reason, most of the major database systems have implemented some variants of bitmap index in their products. However, their effectiveness is limited to data variables whose domain has a relatively small number of possible values, referred to as low cardinality variables. Users are advised to not use them on high cardinality variables such as velocity, temperature and pressure values produced from a scientific simulation.
FastBit significantly improves the speed of searching operation on both high and low cardinality values with a number of techniques including a vertical data organization, an innovative bitmap compression technique, and several new bitmap encoding methods.
Key to FastBit’s success is an innovative bitmap compression method called the Word-Aligned Hybrid (WAH) compression. Typically, bitmap indexes are too large to be stored in computer memory, therefore to answer a query using an index, the relevant part of the index has to be read into the computer’s memory before the necessary computation can be performed. The conventional wisdom is that the time needed to read parts of the index is much greater than the time needed to perform the actual query. However, answering a query using some compressed bitmap indexes in fact requires more time after the bitmaps have been read into memory, much more than the time to read the bitmaps. Therefore, the most direct way to improve the efficiency of a compressed bitmap is to make the compression method more compute-friendly.
The FastBit team invented such a compression method (patented in 2004) and demonstrated it to be 10 times faster than the best known bitmap compression method. In addition to being compute-efficient, it is also effective in reducing the index size. The size of a FastBit index is on average one-third the size of the original data, which is about one-tenth the size of the most widely used B-Tree index.
The R&D100 Award for the public release of FastBit is the latest in a series of honors from the research community. For example, FastBit enabled Grid-based analysis of high-energy physics data received an award from the 2005 International Supercomputer Conference and the FastBit developers’ work on network traffic analysis received an honorable mention in High Performance Analytics Challenge at the Supercomputing 2005 conference. FastBit publications have appeared in world class journals and conferences spanning both the database research field and visualization field, including ACM Transactions on Database Systems, Very Large Database conference and IEEE Visualization conference. FastBit has played crucial roles in two Ph.D. theses from two top universities, the University of California Berkeley and the University of Illinois Urbana-Champaign.
“In short, FastBit contains significant innovations that are well-recognized and have wide impacts in science, technology and education,” Wu said.