|
Background
Biological data management, which addresses the problems
of collection, storage, organization, management, retrieval,
and integration of rapidly expanding, evolving, and heterogeneous
biological data, is considered today one of the most critical
areas of modern data intensive biology research.[1] The main
focus of the past several years has been the development of
methods and technologies supporting high-throughput generation
of biological data, such as DNA sequence and gene expression
data. Compared to the rapid advances in the area of instrumentation,
biological data management is still relatively immature.
Commercial Biological Data Management
In pharmaceutical and biotech companies, biological data
management supports research and development in various stages
of drug development. Data from a variety of experiments, such
as gene expression experiments, need to be collected, interpreted,
validated, tracked, managed and integrated. Oftentimes native
experimental data need to be set in the context of extensive
annotations collected from diverse public and private biological
data repositories, and therefore further data collection,
validation and integration are required. Biological data management
activities in industry settings are carried out as part of
specialized bioinformatics groups, involve mainly commercial
off-the-shelf (COTS) data management systems and tools, and
are sensitive to data acquisition and tracking (data provenance)
requirements which are sometimes mandated.[2] Custom data
management systems are sometimes developed in order to address
application specific requirements that cannot be satisfied
using COTS systems or tools.[3]
Bioinformatics companies are mainly focused on developing
tools, such as LION's DiscoveryCenter,[4] for managing, integrating,
and/or analyzing biological data. Data from public biological
data sources are sometimes also assembled, re-packaged, and
provided as part of a data integration platform, such as LION's
SRS system.
Public Biological Data Management
Public biological data are usually generated in specialized
laboratories or centers, such as the Joint Genome Institute
(JGI), and then collected by data centers, such as those at
NCBI [5] and EBI,[6] where data may undergo some level of
annotation.
Public data centers sometimes prefer using free, rather than
commercial, data management system, such as MySQL, or develop
native data management systems, such as the AceDB system used
for managing databases at the Sanger Institute.[7] Reasons
for this preference include the high cost of commercial systems
such as Oracle, compounded by the expense entailed by trained
data management professionals, needed to use effectively such
systems, and limitations of commercial systems to support
biological data specific structures and operations. Public
data centers operate under less stringent data provenance
requirements than their industry counterparts.
In addition to the large centers, such as NCBI and EBI, that
manage community repositories, there are also numerous smaller
scale centers, such as CBIL,[8] developing specialized biological
database. Such centers are working under more restricted funding,
and therefore follow less stringent maintenance policies.
Some of these centers have also an educational role (for example,
CBIL is part of University of Pennsylvania's bioinformatics
program) and engage in more forward looking R&D projects.
Biological Data Management Challenges
Biological data management involves the traditional areas
of data generation and acquisition, data modeling, data integration,
and data analysis. Technology platforms for generating biological
data present data management challenges arising from the need
to capture, organize, interpret and archive vast amounts of
experimental data. Platforms keep evolving with new versions
benefiting from technological improvements, such as higher
density arrays and better probe selection for microarrays.[9]
This evolution raises the additional problem of collecting
potentially incompatible data generated using different versions
of the same platform, encountered both when these data need
to be integrated and analyzed. Further challenges include
qualifying the data generated using inherently imprecise tools
and techniques and the high complexity of integrating data
residing in diverse and poorly correlated repositories.
A number of biological data management challenges have been
examined in the context of both traditional and scientific
database applications. When considering these challenges,
it is important to determine whether they require new or additional
research, or can be addressed by adapting and/or applying
existing data management tools and methods to the biological
domain. Successful commercial biological data management systems
and products[10] suggest that existing data management tools
and methods, such as commercial database management systems,
data warehousing tools, and statistical methods can be adapted
effectively to the biological domain. For example, the development
of Gene Logic's gene expression data management system has
involved modeling and analyzing microarray data in the context
of gene annotations (including sequence data from a variety
of sources), pathways, and sample (e.g., morphology, demography,
clinical) annotations, and has been carried out using or adapting
existing tools.[11] Dealing with data uncertainty or inconsistency
for experimental data has required statistical, rather than
traditional data management, methods; adapting statistical
methods to gene expression data analysis at various levels
of granularity has been the subject of intense research and
development in recent years.[12] The most difficult problems
have been encountered in the area of data semantics and data
slicing - the former regards properly qualifying data values
(e.g., an expression estimated value) and their relationships,
especially in the context of continuously changing platforms
and evolving biological knowledge, while the latter regards
identifying the logical units of data for analysis in order
to allow effective data mining. While such problems are encountered
across all biological data management areas, from data generation
through data collection and integration to data analysis,
the solutions require domain specific knowledge and extensive
data definition and curation work, with data management providing
the framework (e.g., controlled vocabularies, ontologies)
to address these problems.
A different, but no less serious, challenge is posed by the
complexity of selecting methods and tools to develop a biological
data management system. Such a system may involve a mix of
commercial off the shelve (COTS) tools, open source, and custom
developed software. COTS tool vendors, such as Oracle,[13]
IBM,[14] and EMC,[15] have established Life Sciences divisions
or programs that are dedicated to show how their tools address
key problems in a Life Science organization. However, the
complexity of COTS tools pose a substantial challenge when
devising a biological data management system. For example,
while relational DBMSs have been used extensively for developing
both commercial and public biological data management systems,
employing effectively a DBMS is a demanding and complex task.
Furthermore, COTS based solutions could lead to overly expensive
and not necessarily optimal solutions to a specific problem.
Conversely, open source tools and software, such as MySQL,
do not carry any up front costs, but are sometimes more limited
than COTS tools.
Solutions to biological data management challenges need to
be considered in terms of complexity, cost, robustness, performance,
user and application specific requirements, as well as in
the context of well defined timeframes- depending on context,
partial but rapidly developed solutions may be more valuable
than complete but time consuming solutions. Systems that are
appropriate in a given context may be inadequate in a different
context - for example, a system that is appropriate in the
context of a small exploratory system confined to a small
group is likely to be inadequate in a data intensive environment
with numerous users, where reliability, robustness, comprehensibility,
and performance are critical.
Addressing data management challenges effectively requires
expertise in several areas, such as data modeling, database
administration, data sharing and security, software engineering,
software and data management quality control, statistics,
data management infrastructure. Few organizations, especially
in academia, can afford setting up data management groups
because of the high complexity and cost involved. This problem
can be addressed by pulling together resources for a Data
Management and Technology Center that can serve multiple organizations.
Biological Data Management and Technology Center
Rationale
The need for biological data, biocomputing, and software
centers is discussed in DOE's Genome to Life (GTL)[16] program
and NIH's Roadmap for Accelerating Medical Discovery to Improve
Health.[17] GTL envisions four different types of facilities
generating data that would be organized in a variety of databases,
including expression, proteomic, protein-function, chemistry,
and pathway databases. Data generation in these facilities
will be controlled using workflow management and/or Laboratory
Information Management Systems (LIMS). Data will be collected,
archived, and passed through a number of processing stages,
including data annotation and integration. GTL also envisions
computing infrastructure facilities in the form of software,
biocomputing and data centers. In particular, a "seamless
and effectively centralized capability to deal with data"
in the form of data centers collecting and integrating effectively
large scale biological data is seen as key to GTL's success.
Requirements for computing infrastructure have been discussed
in a series of GTL workshops.[18] These workshops have identified
a number of data management issues that are deemed important
for GTL's success and that may require further research, but
the workshops have not addressed the question of how the basic
facilities and the various computing infrastructure facilities
would interact. The specific goals and functions of a biological
data center have also not been discussed at these workshops.
Structure and Functions
A data center as envisioned by the GTL initiative needs to
address key data management challenges including the massive
and ongoing increase in the amount and range of biological
data, the difficulty of quantifying the quality of data generated
using inherently imprecise tools and techniques, and the high
complexity of integrating data residing in diverse and sometimes
poorly correlated repositories. Addressing these challenges
requires a strategy for devising effective solutions that
respond to the immediate requirement of supporting both ongoing
data generation and pursuing longer term goals.
A Biological Data Management and Technology Center should
be based on proven strengths of both commercial and public
centers. Setting up such a center employing industry practices
in funding and organization ensures maintaining a focused
effort in conjunction with the development of "industrial
strength" databases and data management tools. A biological
data center also needs academic high standards, the discipline
and rigor that are required for the development of scientifically
sound methods and techniques for generating and interpreting
biological data.[19]
Rigorous data management practices and sound expertise are
needed for addressing large scale biological data generation,
collection and validation, which often involve complex data
acquisition, tracking and control systems. Such problems mainly
require deploying or adapting existing tools and platforms,
such as Laboratory Information Management Systems, Database
Management Systems, and Data Warehouse tools. Accordingly,
a biological data center needs qualified database management
and administration professionals, software engineers for adapting
and/or integrating tools, and (bio) statisticians for handling
platform specific data interpretation and validation. An important
task for biological data management centers is to provide
efficient, reliable and secure access to its data to a large
community of scientists as well as other centers. This task
can be addressed by using or adapting existing (hardware or
software) data mirroring or (hardware or software) accessing
technologies.
A Biological Data Management and Technology Center also needs
to pursue long term goals with regard to critical data management
problems that cannot be resolved using existing technology.
Since data management technology is evolving, the center must
be involved in a continuous and detailed technology assessment,
including benchmarking [20] and cost assessment of potential
solutions.[21] Cost effectiveness and ability to take advantage
of rapid technological advances without loss of quality, time,
and cost, should be build into data management solutions that
are inherently evolving.
Research needs to be conducted in order to address critical
problems that are not supported by existing technology. Collaboration
with commercial companies, such as Oracle, Sun, IBM, may defray
the costs of such activities.
Goals
The main goal of the Biological Data Management and Technology
Center will be to serve as a source of expertise in and provide
support for data management activities at the Joint Genome
Institute, Life Sciences and Physical Biosciences Divisions
at LBNL, UCSF's Cancer Center, and other Biomedical and Biotechnology
Centers in the Bay Area. The Center will provide services
based on collaborations with these organizations. Collaborations
with the Center will be cost effective by allowing multiple
organizations to share the experience, skills, and data management
technology at the Center.
Initially, the Center will focus on providing support to
the Join Genome Institute (JGI) where a number of areas that
will benefit from the Center's services have been identified.
JGI provides key sources of data to be managed at as well
as the initial biological programmatic context for the Center.
Several JGI data management areas that could be improved are
briefly discussed below. Once the Center is established, additional
areas in which the Center can provide support for JGI will
be identified after further review of JGI's planned activities.
- Sequence Data Organization and Retrieval. The Production
Genomic Facility (PGF) produces about 2 million files per
month of trace data, 100 assembled projects per month, and
several very large assembled projects per year. PGF is currently
increasing its sequencing capability increasing the challenge
of making data available online, whereby online access to
trace files may be required for quality control and functional
genomics purposes. The Center will provide PGF with a solution
to this problem. Specific tasks will include revising existing
procedures for capturing information (metadata) about sequence
data files and grouping these files in order to improve
their organization at all levels of granularity, and developing
mechanisms for automatic organization of these files as
well as for their effective retrieval and processing.
- JGI Portal. JGI makes available its completed sequences
to the scientific community through a web portal. The Center
will assist JGI in enhancing its portal. Specific tasks
will include, reviewing the organization of the current
portal, working jointly with JGI's portal group to enhance
its functionality (for example, through a closer integration
of sequence data with genome annotations, such as functional
and pathway information, in other public genome data resources),
and extending the portal's search and query capabilities.
Another area of improvement for the portal that will be
considered, is on-line management of sponsored sequencing
projects, whereby sponsors would be able to follow the progress
of their sequencing projects and gain access to data without
delay.
- Microbial Sequencing. JGI is ramping up its microbial
sequencing efforts and starts work in the new area of "community
sequencing" which involves novel microbial genomes
from environmental samples found in a diverse range of habitats.
This new type of sequencing requires a change in the way
sequence data are modeled, including support for acquiring
and storing contextual (e.g., environmental) properties
that are essential in characterizing the sequence data.
Although, completed sequences will continue to be deposited
in GenBank to allow public access to these data, this may
not be sufficient for holding all the information associated
with community sequencing data. The Center will address
this problem by devising a new sequence data resource that
would complement GenBank and will include data that does
not fit GenBank. Specific tasks will include gathering requirements
specific to the community sequencing activities, designing
and developing a data management system for acquiring data
for both sequence and contextual data, and developing a
data resource storing these data and available to the scientific
community.
Longer term, the Center will establish collaborative relationships
with and provide support for biological research programs
at LBNL's Life Sciences and Physical Biosciences Divisions,
the Biotechnology Programs at UCB and UCSF, and will be involved
in future National Centers set up in the Bay Area. Dr. Joe
Gray, head of LBNL's Life Sciences Division has expressed
strong support for establishing a Biological Data Management
and Technology Center and pledged to involve it in future
proposals that have a data management component.
Lawrence Berkeley National Lab (LBNL) provides an ideal location
for a biological data center with its premier multidisciplinary
research environment. In particular, NERSC, and ongoing image
analysis, visualization and scientific data management research
in the Computational Research Division can complement the
data interpretation, visualization, and analysis efforts in
a Biological Data Management and Technology Center.
From an educational point of view, the center will provide
an ideal environment for students to gain practical experience
in large scale biological data management and analysis, and
can draw upon and complement programs in the Computer Science,
Statistics, and Bioengineering Departments at UC Berkeley.
1. "Bioinformatics: Getting Results in the Era of High-Throughput
Genomics", Branca, M.A., Goodman, N., and Venkatesh,
T.V., Cambridge Healthtech Institute Report 9, May 2001.
2. FDA, Guidance for Industry, Part 11, Electronic Records;
Electronic Signatures Scope and Application, http://www.fda.gov/cder/guidance/5667fnl.htm.
3. An example of such a system is described in "Process
Biology: Managing Information Flow for Improved Decision Making
in Preclinical R&D", Reidhaar-Olson, J.F, Ohkawa,
H., Babiss, L.E., and Hammer, J., Preclinica, Vol. 1, No.
4, 2003.
4. LION DiscoveryCenter, http://www.lionbioscience.com/solutions/discoverycenter.
5. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/.
6. European Bioinformatics Institute Databases, http://www.ebi.ac.uk/Databases/.
7. AceDB, Sanger Institute, http://www.acedb.org/.
8. Computational Biology and Informatics Laboratory in the
Center for Bioinformatics at the University of Pennsylvania,
http://www.cbil.upenn.edu/.
9. "DNA Microarray Informatics: Key Technological Trends
and Commercial Opportunities", Branca M.A. and Goodman,
N., Cambridge Healthtech Institute Report 19, February 2002.
10. For example, see gene expression data products such as
Gene Logic's Genesis Enterprise System, http://www.genelogic.com/solutions/genesis/,
Silicon Genetics's GeneSpring System, http://www.silicongenetics.com/cgi/SiG.cgi/Products/GeneSpring/index.smf,
and Rosetta's Resolver System, http://www.rosettabio.com/products/resolver/default.htm.
11. Markowitz, V.M., Campbell, J., Chen, I.A., Kosky, A.,
Palaniappan, K., and Topaloglou, T., "Integration Challenges
in Gene Expression Data Management." Chapter in Bioinformatics:
Managing Scientific Data, Morgan Kauffman Publishers (Elsevier
Science), 2003, pp. 277-301.
12. See for example, http://oz.berkeley.edu/users/terry/zarray/Html/index.html.
13. Oracle, Solutions for Life Sciences, http://www.oracle.com/industries/life_sciences/index.html?content.html.
14. IBM Life Sciences, http://www-3.ibm.com/solutions/lifesciences/.
15. EMC, Life Sciences Infrastructure Solutions, http://www.emc.com/vertical/pdfs/life_sciences/interstitial_data_warehouse.jsp.
16. "User facilities for 21st Century Systems Biology:
Providing Critical Technologies for the Research Community",
Department of Energy, Office of Biological and Environment
Research, November 2002, http://www.doegenomestolife.org/pubs.html.
17. NIH Roadmap: Bioinformatics and Computational Biology,
http://nihroadmap.nih.gov/bioinformatics/index.asp.
18. Mathematics for GTL Workshop, Gaithersburg, Maryland;
March 18-19, 2002, http://www.doegenomestolife.org/pubs/GTLMath-6.pdf.
Computer Science for GTL Workshop, Gaithersburg, Maryland;
March 6-7, 2002, http://www.doegenomestolife.org/compbio/mtg_1_22_02/infrastructure.pdf.
19. For example, gene expression data interpretation methods
have been improved in recent years mainly due to active academic
research see for example, A Benchmark for Affymetrix
GeneChip Expression Measures, http://affycomp.biostat.jhsph.edu/.
20. Benchmarking is needed in order to gain a good understanding
of existing technologies, beyond the hype usually surrounding
them.
21. Industry (so called P&L) cost assessment is a good
way of determining both the short and long term advantage
of developing in house solutions compared to acquiring off
shelf solutions.
|