Data Management in the DOE Genomics:GTL Program - PowerPoint PPT Presentation

About This Presentation
Title:

Data Management in the DOE Genomics:GTL Program

Description:

Data Management in the. DOE Genomics:GTL Program. Janet ... Lawrence Berkeley National Laboratory. University of California, Berkeley ... digital notepad ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 21
Provided by: wwwconfSl9
Category:

less

Transcript and Presenter's Notes

Title: Data Management in the DOE Genomics:GTL Program


1
Data Management in theDOE GenomicsGTL Program
Janet Jacobsen and Adam Arkin Lawrence Berkeley
National Laboratory University of California,
Berkeley
2
Topics (talk or handout)
  • Basic facts about the GenomicsGTL Program
  • Goals of the GTL Program
  • Experimental data generated by GTL
  • Laboratory methods
  • Data management challenges, requirements, and
    needs
  • Survey on Data Standards, Data Sharing, and Data
    Management if time
  • Overall Recommendations

Lawrence Berkeley National Laboratory ?
University of California
2
3
GenomicsGTL Program
  • Genomes to Life renamed GenomicsGTL
  • One of three DOE genome programs
  • First funding awards in July 2002
  • Plan to fund and develop four user facilities
  • Production and Characterization of Proteins
  • Whole Proteome Analysis
  • Characterization and Imaging of Molecular
    Machines
  • Analysis and Modeling of Cellular Systems

Lawrence Berkeley National Laboratory ?
University of California
3
4
Goals of the GTL Program
  • Microbes are ubiquitous and have adapted to
    practically every environmental niche on earth.
    Some live and thrive in conditions generally
    thought to be inhospitable to life.
  • GTL plans to study microbes and microbial
    communities that may be helpful in
  • energy generation,
  • environmental cleanup,
  • carbon sequestration.

Lawrence Berkeley National Laboratory ?
University of California
4
5
Categories of Experimental Data
  • Biomass production
  • Genomic
  • sequence and annotate the microbes genome
  • Transcriptomic
  • study transcription under different conditions
  • Proteomic
  • what proteins are present and at what levels
  • Metabolomic
  • what metabolites are present
  • and others

Lawrence Berkeley National Laboratory ?
University of California
5
6
Laboratory Methods
  • Biomass production
  • cell culture
  • Transcriptomic (HTP)
  • microarrays
  • Proteomic (HTP)
  • 2D gels, mass spectrometry
  • Metabolomic (HTP)
  • mass spectrometry, NMR

Lawrence Berkeley National Laboratory ?
University of California
6
7
Data Volume and Complexity
raw data
peak list
  • Example mass spectrometry
  • mass spec used to identify proteins
  • raw data analyzed to get peak list
  • peak list used to identify peptides
  • database search to identify proteins from
    peptides
  • Volume
  • size of raw data set per experiment 10 GB
  • multiple experiments per __/per organization
  • use FedEx to ship disk drives
  • Complexity see PEDRo UML class diagram on next
    slide

peptides
proteins
Lawrence Berkeley National Laboratory ?
University of California
7
8
8
9
Data Management Challenges
  • INTEGRATING DATA FROM DIVERSE SOURCES IS THE KEY
    TO GTLS SUCCESS
  • diverse different laboratory methods,
    different organizations, different aspects of
    cellular functions/pathways
  • CAPTURING METADATA IS VERY IMPORTANT
  • In the future, we must be able to process LARGE
    numbers of LARGE data sets
  • Item 3 is important, but not as important as
  • items 1 and 2. We have to address those first.

Lawrence Berkeley National Laboratory ?
University of California
9
10
Why is Data Integration So Important to the GTL
Program?
  • Experimental data will be used to build models of
    cellular pathways, i.e., what goes on inside of
    the cell. Different types of data contribute to
    building different aspects of the model (response
    to environmental conditions, growth phases,
    etc.). Think of building a pathway as an inverse
    problem.
  • In addition, experimental data are used to verify
    models.

Lawrence Berkeley National Laboratory ?
University of California
10
11
Why are MetaData So Important to the GTL
Program?
  • We need to capture not only sample treatment
    (e.g., heat shock, oxygen stress), but all of the
    conditions under which an experimental analysis
    was performed. Otherwise we cannot compare the
    results from different experiments. We want to
    investigate how the same organism responds to
    different conditions, and how different organisms
    respond to the same condition. We also want to
    capture uncertainty.

Lawrence Berkeley National Laboratory ?
University of California
11
12
Other Data Management Needs
  • All of the usual ones
  • secure access
  • storage of large volumes of data
  • data archives
  • data provenance
  • plus one wrinkle staging of data access
  • and management.

Lawrence Berkeley National Laboratory ?
University of California
12
13
Staging of Data Access/Management
  • Stage 1 data collected and QA/QC within the lab
    producing the data manage data locally.
  • Stage 2 data are shared with other project
    collaborators transport data and/or provide
    restricted access.
  • Stage 3 data are published and move into the
    public domain provide community-wide access to
    data.
  • Stage 4 data are archived need to provide safe
    storage that data could be retrieved from.

Lawrence Berkeley National Laboratory ?
University of California
13
14
Survey on Data Standards, Data Sharing, and Data
Management
  • Follow up to work by the GTL Data Standards
    Working Group
  • Link to survey mailed to registrants for GTL
    Program Workshop
  • 50 respondents mostly experimental biologists
    26 from natl labs, 16 from universities, 8
    from other organizations
  • See handout for summary of survey results

Lawrence Berkeley National Laboratory ?
University of California
14
15
Survey Results
  • Most common data format (78) spreadsheet
  • Most common measurement type (70) image
  • Few respondents are using any data standard.
  • FCS (Flow Cytometry Standard), which is a file
    format, is the only data standard that received a
    high rating.
  • About 20 of the respondents expressed a
    willingness to participate in developing or
    implementing data standards for GTL.

Lawrence Berkeley National Laboratory ?
University of California
15
16
Recommendations from the Survey
  • Checklist of required information about
    experiments, experimental conditions, and data
  • Data standards, data formats, file formats
  • Software tools/Web interfaces for
  • data entry, including metadata and experiment
    details
  • data uploading, query, and access
  • Data organization to relate information on sample
    origin to experimental data on the sample
  • DBMS with software to enter data

Lawrence Berkeley National Laboratory ?
University of California
16
17
Comments from the Survey
  • It will help me a lot if someone will offer a
    short seminar on data standards.
  • Data standards are of more interest to computer
    scientists than to biological scientists.
  • This is all Greek to me which is exactly why
    very little to nothing is being developed that is
    useful to biologists like me.

Lawrence Berkeley National Laboratory ?
University of California
17
18
Difficulties in GTL Data Management
  • Heterogenous data. Metadata. Uncertainty.
  • Lack of data standards. (Love/hate
    relationship.)
  • Variety of DBMS being used.
  • Variety of instrument output formats.
  • Different DM phases with respect to data
    generation, analyses, and publication.
  • Human factors lab notebook -gt electronic format
    (potential loss of information), data
    rearrangement in spreadsheets.
  • Data attribution.

Lawrence Berkeley National Laboratory ?
University of California
18
19
Overall Recommendations
  • GTL Program
  • Establish data standards and facilitate
    implementation. Data standards MUST be
    compatible with formats required by journals.
  • Establish project-wide schema for organism/gene
    based database(s) to facilitate integration.
  • Address data conversion problem.
  • DOE Require description of data management
  • plan as part of proposal. (Currently being
    done?)
  • Investigate digital notepad technology?

Lawrence Berkeley National Laboratory ?
University of California
19
20
Acknowledgements
Carol Giometti Argonne National Lab Frank
Olken Lawrence Berkeley National
Laboratory Nancy Slater, GTL Project
Manager Lawrence Berkeley National Laboratory
Lawrence Berkeley National Laboratory ?
University of California
20
Write a Comment
User Comments (0)
About PowerShow.com