Title: Data Management in the DOE Genomics:GTL Program
1Data Management in theDOE GenomicsGTL Program
Janet Jacobsen and Adam Arkin Lawrence Berkeley
National Laboratory University of California,
Berkeley
2Topics (talk or handout)
- Basic facts about the GenomicsGTL Program
- Goals of the GTL Program
- Experimental data generated by GTL
- Laboratory methods
- Data management challenges, requirements, and
needs - Survey on Data Standards, Data Sharing, and Data
Management if time - Overall Recommendations
Lawrence Berkeley National Laboratory ?
University of California
2
3GenomicsGTL Program
- Genomes to Life renamed GenomicsGTL
- One of three DOE genome programs
- First funding awards in July 2002
- Plan to fund and develop four user facilities
- Production and Characterization of Proteins
- Whole Proteome Analysis
- Characterization and Imaging of Molecular
Machines - Analysis and Modeling of Cellular Systems
Lawrence Berkeley National Laboratory ?
University of California
3
4Goals of the GTL Program
- Microbes are ubiquitous and have adapted to
practically every environmental niche on earth.
Some live and thrive in conditions generally
thought to be inhospitable to life. - GTL plans to study microbes and microbial
communities that may be helpful in - energy generation,
- environmental cleanup,
- carbon sequestration.
Lawrence Berkeley National Laboratory ?
University of California
4
5Categories of Experimental Data
- Biomass production
- Genomic
- sequence and annotate the microbes genome
- Transcriptomic
- study transcription under different conditions
- Proteomic
- what proteins are present and at what levels
- Metabolomic
- what metabolites are present
- and others
Lawrence Berkeley National Laboratory ?
University of California
5
6Laboratory Methods
- Biomass production
- cell culture
- Transcriptomic (HTP)
- microarrays
- Proteomic (HTP)
- 2D gels, mass spectrometry
- Metabolomic (HTP)
- mass spectrometry, NMR
Lawrence Berkeley National Laboratory ?
University of California
6
7Data Volume and Complexity
raw data
peak list
- Example mass spectrometry
- mass spec used to identify proteins
- raw data analyzed to get peak list
- peak list used to identify peptides
- database search to identify proteins from
peptides - Volume
- size of raw data set per experiment 10 GB
- multiple experiments per __/per organization
- use FedEx to ship disk drives
- Complexity see PEDRo UML class diagram on next
slide
peptides
proteins
Lawrence Berkeley National Laboratory ?
University of California
7
88
9Data Management Challenges
- INTEGRATING DATA FROM DIVERSE SOURCES IS THE KEY
TO GTLS SUCCESS - diverse different laboratory methods,
different organizations, different aspects of
cellular functions/pathways - CAPTURING METADATA IS VERY IMPORTANT
- In the future, we must be able to process LARGE
numbers of LARGE data sets - Item 3 is important, but not as important as
- items 1 and 2. We have to address those first.
Lawrence Berkeley National Laboratory ?
University of California
9
10Why is Data Integration So Important to the GTL
Program?
- Experimental data will be used to build models of
cellular pathways, i.e., what goes on inside of
the cell. Different types of data contribute to
building different aspects of the model (response
to environmental conditions, growth phases,
etc.). Think of building a pathway as an inverse
problem. - In addition, experimental data are used to verify
models.
Lawrence Berkeley National Laboratory ?
University of California
10
11Why are MetaData So Important to the GTL
Program?
- We need to capture not only sample treatment
(e.g., heat shock, oxygen stress), but all of the
conditions under which an experimental analysis
was performed. Otherwise we cannot compare the
results from different experiments. We want to
investigate how the same organism responds to
different conditions, and how different organisms
respond to the same condition. We also want to
capture uncertainty.
Lawrence Berkeley National Laboratory ?
University of California
11
12Other Data Management Needs
- All of the usual ones
- secure access
- storage of large volumes of data
- data archives
- data provenance
- plus one wrinkle staging of data access
- and management.
Lawrence Berkeley National Laboratory ?
University of California
12
13Staging of Data Access/Management
- Stage 1 data collected and QA/QC within the lab
producing the data manage data locally. - Stage 2 data are shared with other project
collaborators transport data and/or provide
restricted access. - Stage 3 data are published and move into the
public domain provide community-wide access to
data. - Stage 4 data are archived need to provide safe
storage that data could be retrieved from.
Lawrence Berkeley National Laboratory ?
University of California
13
14Survey on Data Standards, Data Sharing, and Data
Management
- Follow up to work by the GTL Data Standards
Working Group - Link to survey mailed to registrants for GTL
Program Workshop - 50 respondents mostly experimental biologists
26 from natl labs, 16 from universities, 8
from other organizations - See handout for summary of survey results
Lawrence Berkeley National Laboratory ?
University of California
14
15Survey Results
- Most common data format (78) spreadsheet
- Most common measurement type (70) image
- Few respondents are using any data standard.
- FCS (Flow Cytometry Standard), which is a file
format, is the only data standard that received a
high rating. - About 20 of the respondents expressed a
willingness to participate in developing or
implementing data standards for GTL.
Lawrence Berkeley National Laboratory ?
University of California
15
16Recommendations from the Survey
- Checklist of required information about
experiments, experimental conditions, and data - Data standards, data formats, file formats
- Software tools/Web interfaces for
- data entry, including metadata and experiment
details - data uploading, query, and access
- Data organization to relate information on sample
origin to experimental data on the sample - DBMS with software to enter data
Lawrence Berkeley National Laboratory ?
University of California
16
17Comments from the Survey
- It will help me a lot if someone will offer a
short seminar on data standards. - Data standards are of more interest to computer
scientists than to biological scientists. - This is all Greek to me which is exactly why
very little to nothing is being developed that is
useful to biologists like me.
Lawrence Berkeley National Laboratory ?
University of California
17
18Difficulties in GTL Data Management
- Heterogenous data. Metadata. Uncertainty.
- Lack of data standards. (Love/hate
relationship.) - Variety of DBMS being used.
- Variety of instrument output formats.
- Different DM phases with respect to data
generation, analyses, and publication. - Human factors lab notebook -gt electronic format
(potential loss of information), data
rearrangement in spreadsheets. - Data attribution.
Lawrence Berkeley National Laboratory ?
University of California
18
19Overall Recommendations
- GTL Program
- Establish data standards and facilitate
implementation. Data standards MUST be
compatible with formats required by journals. - Establish project-wide schema for organism/gene
based database(s) to facilitate integration. - Address data conversion problem.
- DOE Require description of data management
- plan as part of proposal. (Currently being
done?) - Investigate digital notepad technology?
Lawrence Berkeley National Laboratory ?
University of California
19
20Acknowledgements
Carol Giometti Argonne National Lab Frank
Olken Lawrence Berkeley National
Laboratory Nancy Slater, GTL Project
Manager Lawrence Berkeley National Laboratory
Lawrence Berkeley National Laboratory ?
University of California
20