Enron Email Database Proves Easy Pickings for LBNL’s FastBit Search Technology
February 10, 2006
As the trial of former Enron executives gets under way, the extensive email trails left by employees of the Houston energy firm are expected to provide both compelling evidence and entertaining insight.
In 2003, as part of an investigation into Enron’s business dealings in California, the Federal Energy Regulatory Commission made public a database containing more than 500,000 emails sent by 151 Enron employees. Subjects ranged from corporate decisions to jokes to personal matters. While the subject matter makes for intriguing reading, the entire database also proved an interesting subject for a number of researchers around the country, including members of the Scientific Data Management Research Group at Lawrence Berkeley National Laboratory.
According to Carnegie Mellon University’s William W. Cohen, who posted the dataset on the Web, the Enron email dataset is proving to be “a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of ‘real’ email that is public.” As a result, researchers at such institutions as MIT, UC Berkeley, University of Massachusetts, University of Southern California and SRI have used the data to study social networks as evidenced by the exchange of email messages.
The Berkeley Lab group decided to conduct a series of searches of the Enron email dataset to see how FastBit, an efficient, compressed bitmap indexing technology that was developed by the group, stacked up against the MySQL database, which bills itself as “the world’s most popular open source database,” in which the data were stored.
In a report published in January 2006, the group evaluated the performance of MySQL and FastBit in handling a number of queries for a dataset of 250,000 unique email messages sent by 151 Enron employees and found that FastBit outperformed MySQL — between 10 to 1,000 times faster, depending on the size of the search result. To achieve their results, group members conducted several experiments.
“In our first set of experiments we measured the performance of searching for specific senders and receivers of the emails,” wrote Kurt Stockinger, lead author of the group’s report. “We built an index for each of these two attributes. Since both senders and receivers are in different database tables, this kind of search requires an expensive ‘join operation.’ “
To reduce the processing time needed for such join operations, the group combined the two separate tables to create a new table which contained names of all the senders and receivers. Called the materialized table, this newly created table contains 2 million records. Since the number of original messages was 250,000, this indicates that, on average, each message contains 8 recipients.
“We also built indices for sender and receiver on the materialized table,” Stockinger wrote. “In order to build bitmap indices for the materialized table, we needed to export the data into binary files. In particular, we stored each attribute in a separate file and then built a bitmap index for the attributes sender and receiver.”
First, the group measured the performance of queries of the form “Retrieve the recipients of all emails that were sent by person P.” For these experiments, a group of 100 names were randomly selected and a query executed for each person. A total of 100 queries were run and the group measured the retrieval time, including the time to extract the result after the search. The results showed that using the MySQL-Join approach took the most time, about 8 seconds, while MySQL-Materialized (which used the expanded table) took from 0.01 to 0.9 seconds to complete the queries, depending on the number of hits. FastBit took about 0.0075 seconds, regardless of the number of hits, making it up to 100 times faster than the best MySQL result.
When the group measured the performance of queries of the form “Retrieve all senders of emails that were received by person P,” FastBit was again up to a factor of 100 faster than MySQL-Materialized.
In the next experiments the researchers measured the query performance of a larger dataset by duplicating the Enron dataset 10 times. The resulting materialized table contains some 20 million records. In experiments with one specific search criterion, FastBit was again up to a factor of 100 faster than MySQL.
In the group’s last set of experiments, they measured the performance of queries with multiple search criteria (multidimensional queries), such as “Count the number of emails that were sent by person P in the time interval T before date D”. The results showed that as the number of query dimensions increases, the relative performance improvement of FastBit over MySQL increases even more. For these types of queries, FastBit is up to 1000 times faster than MySQL.
FastBit is the software which incorporates the Word-Aligned Hybrid (WAH) compression method developed and patented by John Wu, Arie Shoshani and Ekow Otoo of Berkeley Lab’s Scientific Data Management Research Group. Originally developed to search data from Department of Energy high energy physics experiments, WAH compresses bitmap indexes, a method of reducing the response time of queries involving common types of conditions in data objects, such as "state = CA" and "age >= 21." It achieves this by storing certain pre-computed answers as bitmaps. For example, a bitmap index for "state" might have one bitmap for each state in the U.S.
Because computers can manipulate bitmaps efficiently, bitmap indices are efficient in searching for interesting records in large datasets. WAH compression makes the bitmap index optimal in terms of computational complexity. A small number of the most efficient indexing schemes have this optimality property. What makes the new technology unique is that WAH-compressed indexes significantly outperform other schemes in tests.
The Scientific Data Management Research Group is part of Berkeley Lab’s Computational Research Division (http://crd.lbl.gov/), which creates computational tools and techniques that enable scientific breakthroughs, by conducting applied research and development in computer science, computational science, and applied mathematics.