Ronald Taylor, Ph.D. - PowerPoint PPT Presentation

About This Presentation
Title:

Ronald Taylor, Ph.D.

Description:

Systems Biology at PNNL: http://www.sysbio.org ... Follow-up to the Human Genome Project. ( DOE launched the HGP in 1986. GenBank started at a DOE lab. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 28
Provided by: iscb
Learn more at: https://www.iscb.org
Category:

less

Transcript and Presenter's Notes

Title: Ronald Taylor, Ph.D.


1

The Network Inference Problem and the SEBINI
Platform
  • Ronald Taylor, Ph.D.
  • Computational Biology Bioinformatics Group
  • Computational Sciences Mathematics Division
  • Pacific Northwest National Laboratory (PNNL)
  • Richland, Washington
  • Email ronald.taylor_at_.pnl.gov
  • Systems Biology at PNNL http//www.sysbio.org/

2
DOEs GenomicsGTL program
  • Follow-up to the Human Genome Project. (DOE
    launched the HGP in 1986. GenBank started at a
    DOE lab.)
  • Goal to comprehensively understand cellular
    processes in a realistic context, i.e., systems
    biology. To be accomplished using high-throughput
    advanced technologies and computation (petabyte
    scale databases, integrated knowledgebases,
    network modeling)
  • Under the direction of the DOE Office of Science
    and its suboffices, the Office of Biological and
    Environmental Research (OBER) and the Office of
    Advanced Scientific Computing Research (OASCR).
  • Focused on microbial organisms. (Genomes
    sequenced in the DOE Microbial Genome Program.)

3
DOE Genomics GTL Program Goals - The Role of
Living Systems in Energy Production,
Environmental Remediation, and Carbon Cycling and
Sequestration
  • Molecular Proteins and multicomponent molecular
    machines that perform most of the cell's work
  • Cellular Gene regulatory networks and pathways
    that control cellular processes
  • Community Microbial communities in which groups
    of cells carry out complex processes in nature

4
Methods used to provide various data for
inferring regulatory networks (I)
  • Prediction of transcription factor (TF) binding
    sites e.g., Dr. Lee Ann McCues work at PNNL,
    also MotifMogul software at ISB.
  • Public and commercial databases (for example
    TRANSFAC, SCPD), gradually collecting wet lab
    experiments that identify TFs and their binding
    sites, one by one.
  • Tiling arrays (expensive).
  • Projects to find protein-protein interactions
    e.g., PNNL/ORNL GTL project (expensive).
  • Computational algorithms that infer regulatory
    edges based (primarily) on correlations in state
    the algorithms used in SEBINI. Require a large
    amount of array or protein expression data. More
    powerful algorithms/models are needed to infer
    specific gene-to-gene connections than the
    typical statistical techniques used for
    clustering.

5
Methods used to provide various data for
inferring regulatory networks (II)
  • There are drawbacks to all methods. TFBS
    prediction based on sequence and phylogenetic
    comparison is very hard. Tiling arrays are
    promising, but new (expensive - test one putative
    source TF at a time). Drawbacks to both
    dependence on nearness of TFBS to gene to infer
    target. In eukaryotes, 70 of TFs bind far from
    their targets (A. Aderem). Also neither yields
    regulation type (activator / inhibitor), just
    that there is binding. Also it is common in
    bacteria for a TF to lie between genes
    transcribed in different directions. May regulate
    one or both - which choice is unknown from tiling
    and TFBS prediction.
  • As for determining interaction networks using
    mass spec not yet high-throughput, in terms of
    results.

6
Conclusion
  • Computational algorithms that are based on
    correlations in state will be continue to be
    used, remaining a standard approach for many
    years to come.
  • Bonus gathering the large amount of array data
    required for their use provides the raw data for
    investigation of state functions topic for
    future research.

7
Software Environment for BIological Network
Inference (SEBINI) - Introduction
SEBINI has been created to provide an interactive
environment for the evaluation and deployment of
algorithms used in the reconstruction of the
structure of biological regulatory networks.
SEBINI compares and trains network inference
methods on artificial networks and simulated gene
expression perturbation data. It also allows the
analysis within the same framework of
experimental high-throughput expression data
using the suite of (trained) inference methods.
Hence SEBINI should be useful both to software
developers wishing to evaluate, compare, refine,
or combine inference techniques, and to
bioinformaticians (or biologists) analyzing
experimental data. SEBINI provides a platform
that aids in more accurate reconstruction of
regulatory and interaction networks, with much
less effort, in less time.
8
(No Transcript)
9
(No Transcript)
10
Software Environment for BIological Network
Inference (SEBINI)
PNNLs Bioinformatics Resource Manager (BRM)
Input Module High-throughput experimental data
Builder Module Simulated high-throughput expressio
n data for artificial networks
Text files (flat files)
Visualization of inferred networks via Cytoscape
PNNLs PRISM database system
Human-readable reports on inferred networks
SEBINI Central relational database (PostgreSQL)
User interface web site operated by Java
servlets
Machine-readable network structure files for
dynamic modeling programs
Topological statistics, network annotation,
post-inference processing scoring error
analysis (on artificial data sets)
Collection of network inference algorithms. User
selects algorithm and data set, runs alg to infer
a network (a set of edges). Mutual
information-based and Bayesian network structure
learning algorithms provided for learning
regulatory networks. Also PNNL/ORNL algorithm
for learning protein-protein interaction networks
from PNNL/ORNL bait-prey experiment mass spec
data sets. Inferred networks permanently stored
back into database.
11
SEBINI architecture implementation (I)
  • 100 Java programs (classes) and growing
    rapidly.
  • All inter-servlet communication is routed through
    a CentralControl class. Algorithm handlers are
    called directly from the Java servlets for the
    corresponding web pages. This environment is is
    NOT a spiderweb there is a control chokepoint.
  • 30 PostgreSQL database tables. Slowly growing
    at present quite stable. One major database
    change coming.
  • Data security project based. Upon login, the
    user is assigned a 32 digit hex digit JSessionID,
    which is checked before display of every web
    page.

12
SEBINI architecture implementation (II)
  • While the SEBINI was originally designed to
    infer directed (regulatory) networks, the code
    now allows undirected networks, so algorithms
    that infer interaction networks can be used and
    such networks (e.g., protein-protein interaction
    networks) permanently stored and analyzed.
  • Design issues interface for user navigation
    among huge data sets, database design to map
    inferred networks and inferred edges back to
    original network and expression data IDs must
    be carried forward.
  • expression data ? one-to-many via binning alg
    choice ? binned exp data ? one-to-many via
    inference alg choice ? inferred_network

13
SEBINI architecture implementation (III)
  • Design issues (continued) multi-threaded job
    monitoring web pages to view all data points,
    working towards transparency of all data in all
    tables (security access permitting) in the
    database via display through the web site.
  • Permanent storage of binned/ pre-processed data
    sets. Novel. Important for efficiency,
    transparency, speed of response, analysis of
    results.
  • Jobs times recorded to millisec. Algorithms can
    be compared on efficiency vs relative power.
  • A Java handler class is created for each new
    algorithm, to wrap it, ie, to handle
    communication with the database and web site.

14
SEBINI architecture implementation (IV)
  • SEBINI was initially implemented on a Dell
    desktop running Red Hat Linux, using Java ver.
    1.4, PostgreSQL ver. 7.4, and Tomcat 4.1.
    Cytoscape is used for network visualization,
    invoked through Java Web Start.
  • SEBINI has also been installed on a Windows web
    server that will soon be accessible from outside
    the PNNL, using Java ver. 1.5, Tomcat ver. 5, and
    PostgreSQL 8.1. Jakarta Commons Java libraries
    are used for data file uploads Jakarta.
  • Machine-specific parameters are stored in an
    easily changed properties text file.
  • development site http//asimov.emsl.pnl.gov8080/
    NIT/NIT.html
  • public demo site https//www.emsl.pnl.gov/NIT/NIT
    .html

15
Possible sources of inference algorithms
  • Probabilistic graphical models (structure
    learning Bayesian networks, among others)
  • Information theory (mutual information based, and
    CMI)
  • Classical statistics, analysis of correlation -
    e.g., Pearson correlation
  • Machine learning decision trees (C4.5, ID3),
    supervised and unsupervised
  • Data mining association rule mining (really
    want to try this)
  • Pattern classification
  • Deductive reasoning
  • Neural networks
  • Fuzzy logic?

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
SEBINI flow of control, for experimental data
  • Log into a project, or create a new project
  • Create a network set (container) for the one
    experimental network
  • Upload the experimental expression data file
  • Select a binning algorithm and bin the data
  • Select an inference algorithm, select the alg
    parameters, and infer a network.
  • Visualize the inferred network, now in the
    database, using Cytoscape
  • View topological statistics
  • View node and edge annotations
  • Generate a human-readable report
  • Export the topology in a format suitable for
    input into dynamic simulations

23
SEBINI flow of control, for synthetic data
  • Log into a project, or create a new project
  • Select a topology build algorithm, enter param
    values, and create a synthetic network set
  • Select an expression set build algorithm, enter
    param values, and create synthetic expression
    sets
  • Select a binning algorithm and bin the data
  • Select an inference algorithm, enter param
    values, and infer network
  • Visualize and compare the real and inferred
    network(s), using Cytoscape
  • View topological statistics
  • View precision, recall, F-measure statistics for
    precise measure of how well the inference alg
    performed against the gold standard, the known
    synthetic network.

24
Some goals for SEBINI
  • Make network inference a starting point, not an
    end point (currently an end point that is usually
    not even reached) Simple deployment of
    state-of-the-art algs not previously available to
    a biology lab, available over the web or via
    local SEBINI install.
  • Advance the field by improving the algorithms.
    Possibility of combining alg output (as done in
    GRAIL), now that alg results are stored in same
    database. Develop expertise on how much data is
    needed, appropriate cutoffs, species-specific
    post-processing, the weaknesses of a given
    method, what background information on a genome
    is most useful to supplement the primary
    expression data.
  • Network biology is only in its infancy (2004,
    Barabasi). Nobody knows what inference
    algorithm(s) will perform best - theoretical
    guidance is lacking. But SEBINI will position us
    to empirically test new algorithms, easily modify
    or combine algs - possibly with species specific
    information.

25
DOE Science Undergraduate Laboratory Internship
(SULI) program
  • DOE Science Undergraduate Lab Internship (SULI)
    program. Year-round. Duration 10 wks summer,
    12-16 wks in fall or spring. 400/week housing
    http//science-ed.pnl.gov/undergrad/erulf.stm.
  • Manager Karen Wieda kj.wieda_at_pnl.gov, (509)
    375-3811.
  • Of 200 applications specifying PNNL as 1st round
    choice, 50 offers made last year. Summer term is
    the most competitive. DOE pays most of cost.
  • PNNL Science Engineering Education - Fellowship
    Services. Undergrad, grad, visiting scientists,
    sabbaticals. 400-750/week, 5 months max.
    http//science-ed.pnl.gov/studentops.stm
  • Manager Rebecca Janosky rebecca.janosky_at_pnl.gov
    , (509) 375-2302.
  • About 900-1000 applications/yr, of which 250 get
    an offer. Completed applications posted to
    central site. Mentor/host pays the cost, plus 18
    overhead.

26
Take-home SULI information
  • Apply by early January 2007 for summer 2007. PNNL
    sees students who selected it as first choice on
    Feb 1.
  • SULI manager Ms. Karen Wieda. karen.wieda_at_pnl.gov
    Phone (509) 375-3811
  • Ronald Taylor (for the SEBINI project)
    ronald.taylor_at_pnl.gov
  • PNNL science education web site
    http//science-ed.pnl.gov/ students/
  • Other background web sites www.pnl.gov,
    www.sysbio.org, genomicsgtl.energy.gov/compbio/

27
Acknowledgements
  • For work on SEBINI Anuj Shah for work on
    Cytoscape, MI alg translation, statistics),
    Meridith Blevins (SULI), Charles Treatman (SULI)
  • Funding from the US Dept of Energy, through the
    PNNL Biomolecular Systems Initiative, the EMSL
    Membrane Grand Challenge at PNNL, and the joint
    PNNL/ORNL protein-protein interaction network
    mapping GTL project
Write a Comment
User Comments (0)
About PowerShow.com