In addition to the well-known strengths of bitmap indexes, FastBit has a special strength stemming from the bitmap compression scheme used. The compression method is called the Word-Aligned Hybrid (WAH) code. It reduces the bitmap indexes to reasonable sizes, and at the same time allows very efficient bitwise logical operations directly on the compressed bitmaps. Compared with the well-known compression methods such as LZ77 and Byte-aligned Bitmap code (BBC), WAH sacrifices some space efficiency for a significant improvement in operational efficiency [SSDBM 2002, DOLAP 2001]. Since the bitwise logical operations are the most important operations needed to answer queries, using WAH compression has been shown to answer queries significantly faster than using other compression schemes.
Theoretical analyses showed that WAH compressed bitmap indexes are optimal for one-dimensional range queries. Only the most efficient indexing schemes such as B+-tree and B*-tree have this optimality property. However, bitmap indexes are superior because they can efficiently answer multi-dimensional range queries by combining the answers to one-dimensional queries.
The select clause passed to function ibis::table::select can only contain column names separated by comma (,). Aggregate operations such as MIN, MAX, AVG, SUM, VARPOP, VARSAMP, STDPOP, STDSAMP, or DISTINCT are supported through another function named ibis::table::groupby. A group-by operation normally specified as one SQL statement needs to be split into two FastBit, one to select the values and the other to perform the aggregation operations. We've taken this approach to simplify the implementation. These aggregation operations are not directly supported by bitmap indexes, therefore, they are not essential to demonstrate the effectiveness of the bitmap indexes.
The where clause passed to function ibis::table::select can be a combination of range conditions connected with logical operators such as AND, OR, XOR, and NOT. Assuming that temperature and pressure are names of two columns, the following are valid where clauses (one on each line),
temperature > 10000 pressure between 10 and 100 temperature > 10000 and 50 <= pressure and sin(pressure/8000) < sqrt(abs(temperature))
The class ibis::table also defines a set of functions for computing histograms of various dimensions, namely, ibis::table::getHistogram, ibis::table::getHistogram2D, and ibis::table::getHistogram3D.
Using FastBit, one can only append new records to a table. These operations for extending a table is defined in the class ibis::tablex.
For most fixed-sized data, such as integers and floating-point values, FastBit functions expects raw binary data and also store them as raw binary, therefore the data files and index files are not portable across different platforms. This is common to both ibis::table interface and ibis::part interface. However, one difference is that ibis::table handles string values as std::vector<std::string>, while the lower level interface ibis::part handles strings as raw char* with null terminators.
The user query is represented as an ibis::query object. Each query is associated with one ibis::part object. The functions of the query class can be divided into three groups, (1) specifying a query, (2) evaluating a query, and (3) retrieving information about the hits. The queries accepted by FastBit are a subset of the SQL SELECT statement. Each query may have a WHERE clause and optionally a SELECT clause. Note that the FROM clause is implicit in the association with an ibis::part. The WHERE clause is a set of range conditions joined together with logical operators, e.g., A = 5 AND (B between 6.5 and 8.2 OR C > sqrt(5*D)). The SELECT clause can contain a list of column names and some of the four functions AVG, MIN, MAX, SUM, VARPOP, VARSAMP, STDPOP, STDSAMP and DISTINCT. Each of the functions can only take a column name as its argument. If a SELECT clause is omitted, it is assumed to be "SELECT count(*)." We refer to this type of queries as count queries since their primary purpose is to count the number of hits.
To evaluate a query, one calls either ibis::query::estimate or ibis::query::evaluate. After a query is evaluated, one may call various function to find the number of hits (ibis::query::getNumHits), the values of selected rows (ibis::query::getQualifiedInts, ibis::query::getQualifiedFloats, and ibis::query::getQualifiedDoubles), or the bitvector that represents the hits (ibis::query::getHitVector).
Currently, all indexes are in a single class hierarchy with ibis::index as the abstract base class. The most convenient way to create an index is calling the class function ibis::index::create. One can control what type of bitmap index to use by either specifying an index specification for a whole table by calling ibis::table::indexSpec, for a whole data partition by calling ibis::part::indexSpec, or for each individual column by calling ibis::column::indexSpec. The index specification along with other metadata are written to a file named -part.txt in the directory containing the base data and the index files. The directory name is needed when constructing an ibis::part. This information may be indirectly provided through an RC file specified to the function ibis::init.
This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and DE-AC03-76SF00098.
|
| |