Ronald Taylor, Ph.D. - PowerPoint PPT Presentation

About This Presentation

Title:

Ronald Taylor, Ph.D.

Description:

Systems Biology at PNNL: http://www.sysbio.org ... Follow-up to the Human Genome Project. ( DOE launched the HGP in 1986. GenBank started at a DOE lab. ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 28

Provided by: iscb

Learn more at: https://www.iscb.org

Category:

more less

Transcript and Presenter's Notes

Title: Ronald Taylor, Ph.D.

1

The Network Inference Problem and the SEBINI
Platform

Ronald Taylor, Ph.D.
Computational Biology Bioinformatics Group
Computational Sciences Mathematics Division
Pacific Northwest National Laboratory (PNNL)
Richland, Washington
Email ronald.taylor_at_.pnl.gov
Systems Biology at PNNL http//www.sysbio.org/

2
DOEs GenomicsGTL program

Follow-up to the Human Genome Project. (DOE
launched the HGP in 1986. GenBank started at a
DOE lab.)
Goal to comprehensively understand cellular
processes in a realistic context, i.e., systems
biology. To be accomplished using high-throughput
advanced technologies and computation (petabyte
scale databases, integrated knowledgebases,
network modeling)
Under the direction of the DOE Office of Science
and its suboffices, the Office of Biological and
Environmental Research (OBER) and the Office of
Advanced Scientific Computing Research (OASCR).
Focused on microbial organisms. (Genomes
sequenced in the DOE Microbial Genome Program.)

3
DOE Genomics GTL Program Goals - The Role of
Living Systems in Energy Production,
Environmental Remediation, and Carbon Cycling and
Sequestration

Molecular Proteins and multicomponent molecular
machines that perform most of the cell's work
Cellular Gene regulatory networks and pathways
that control cellular processes
Community Microbial communities in which groups
of cells carry out complex processes in nature

4
Methods used to provide various data for
inferring regulatory networks (I)

Prediction of transcription factor (TF) binding
sites e.g., Dr. Lee Ann McCues work at PNNL,
also MotifMogul software at ISB.
Public and commercial databases (for example
TRANSFAC, SCPD), gradually collecting wet lab
experiments that identify TFs and their binding
sites, one by one.
Tiling arrays (expensive).
Projects to find protein-protein interactions
e.g., PNNL/ORNL GTL project (expensive).
Computational algorithms that infer regulatory
edges based (primarily) on correlations in state
the algorithms used in SEBINI. Require a large
amount of array or protein expression data. More
powerful algorithms/models are needed to infer
specific gene-to-gene connections than the
typical statistical techniques used for
clustering.

5
Methods used to provide various data for
inferring regulatory networks (II)

There are drawbacks to all methods. TFBS
prediction based on sequence and phylogenetic
comparison is very hard. Tiling arrays are
promising, but new (expensive - test one putative
source TF at a time). Drawbacks to both
dependence on nearness of TFBS to gene to infer
target. In eukaryotes, 70 of TFs bind far from
their targets (A. Aderem). Also neither yields
regulation type (activator / inhibitor), just
that there is binding. Also it is common in
bacteria for a TF to lie between genes
transcribed in different directions. May regulate
one or both - which choice is unknown from tiling
and TFBS prediction.
As for determining interaction networks using
mass spec not yet high-throughput, in terms of
results.

6
Conclusion

Computational algorithms that are based on
correlations in state will be continue to be
used, remaining a standard approach for many
years to come.
Bonus gathering the large amount of array data
required for their use provides the raw data for
investigation of state functions topic for
future research.

7
Software Environment for BIological Network
Inference (SEBINI) - Introduction
SEBINI has been created to provide an interactive
environment for the evaluation and deployment of
algorithms used in the reconstruction of the
structure of biological regulatory networks.
SEBINI compares and trains network inference
methods on artificial networks and simulated gene
expression perturbation data. It also allows the
analysis within the same framework of
experimental high-throughput expression data
using the suite of (trained) inference methods.
Hence SEBINI should be useful both to software
developers wishing to evaluate, compare, refine,
or combine inference techniques, and to
bioinformaticians (or biologists) analyzing
experimental data. SEBINI provides a platform
that aids in more accurate reconstruction of
regulatory and interaction networks, with much
less effort, in less time.
8
(No Transcript)
9
(No Transcript)
10
Software Environment for BIological Network
Inference (SEBINI)
PNNLs Bioinformatics Resource Manager (BRM)
Input Module High-throughput experimental data
Builder Module Simulated high-throughput expressio
n data for artificial networks
Text files (flat files)
Visualization of inferred networks via Cytoscape
PNNLs PRISM database system
Human-readable reports on inferred networks
SEBINI Central relational database (PostgreSQL)
User interface web site operated by Java
servlets
Machine-readable network structure files for
dynamic modeling programs
Topological statistics, network annotation,
post-inference processing scoring error
analysis (on artificial data sets)
Collection of network inference algorithms. User
selects algorithm and data set, runs alg to infer
a network (a set of edges). Mutual
information-based and Bayesian network structure
learning algorithms provided for learning
regulatory networks. Also PNNL/ORNL algorithm
for learning protein-protein interaction networks
from PNNL/ORNL bait-prey experiment mass spec
data sets. Inferred networks permanently stored
back into database.
11
SEBINI architecture implementation (I)

100 Java programs (classes) and growing
rapidly.
All inter-servlet communication is routed through
a CentralControl class. Algorithm handlers are
called directly from the Java servlets for the
corresponding web pages. This environment is is
NOT a spiderweb there is a control chokepoint.
30 PostgreSQL database tables. Slowly growing
at present quite stable. One major database
change coming.
Data security project based. Upon login, the
user is assigned a 32 digit hex digit JSessionID,
which is checked before display of every web
page.

12
SEBINI architecture implementation (II)

While the SEBINI was originally designed to
infer directed (regulatory) networks, the code
now allows undirected networks, so algorithms
that infer interaction networks can be used and
such networks (e.g., protein-protein interaction
networks) permanently stored and analyzed.
Design issues interface for user navigation
among huge data sets, database design to map
inferred networks and inferred edges back to
original network and expression data IDs must
be carried forward.
expression data ? one-to-many via binning alg
choice ? binned exp data ? one-to-many via
inference alg choice ? inferred_network

13
SEBINI architecture implementation (III)

Design issues (continued) multi-threaded job
monitoring web pages to view all data points,
working towards transparency of all data in all
tables (security access permitting) in the
database via display through the web site.
Permanent storage of binned/ pre-processed data
sets. Novel. Important for efficiency,
transparency, speed of response, analysis of
results.
Jobs times recorded to millisec. Algorithms can
be compared on efficiency vs relative power.
A Java handler class is created for each new
algorithm, to wrap it, ie, to handle
communication with the database and web site.

14
SEBINI architecture implementation (IV)

SEBINI was initially implemented on a Dell
desktop running Red Hat Linux, using Java ver.
1.4, PostgreSQL ver. 7.4, and Tomcat 4.1.
Cytoscape is used for network visualization,
invoked through Java Web Start.
SEBINI has also been installed on a Windows web
server that will soon be accessible from outside
the PNNL, using Java ver. 1.5, Tomcat ver. 5, and
PostgreSQL 8.1. Jakarta Commons Java libraries
are used for data file uploads Jakarta.
Machine-specific parameters are stored in an
easily changed properties text file.
development site http//asimov.emsl.pnl.gov8080/
NIT/NIT.html
public demo site https//www.emsl.pnl.gov/NIT/NIT
.html

15
Possible sources of inference algorithms

Probabilistic graphical models (structure
learning Bayesian networks, among others)
Information theory (mutual information based, and
CMI)
Classical statistics, analysis of correlation -
e.g., Pearson correlation
Machine learning decision trees (C4.5, ID3),
supervised and unsupervised
Data mining association rule mining (really
want to try this)
Pattern classification
Deductive reasoning
Neural networks
Fuzzy logic?

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
SEBINI flow of control, for experimental data

Log into a project, or create a new project
Create a network set (container) for the one
experimental network
Upload the experimental expression data file
Select a binning algorithm and bin the data
Select an inference algorithm, select the alg
parameters, and infer a network.
Visualize the inferred network, now in the
database, using Cytoscape
View topological statistics
View node and edge annotations
Generate a human-readable report
Export the topology in a format suitable for
input into dynamic simulations

23
SEBINI flow of control, for synthetic data

Log into a project, or create a new project
Select a topology build algorithm, enter param
values, and create a synthetic network set
Select an expression set build algorithm, enter
param values, and create synthetic expression
sets
Select a binning algorithm and bin the data
Select an inference algorithm, enter param
values, and infer network
Visualize and compare the real and inferred
network(s), using Cytoscape
View topological statistics
View precision, recall, F-measure statistics for
precise measure of how well the inference alg
performed against the gold standard, the known
synthetic network.

24
Some goals for SEBINI

Make network inference a starting point, not an
end point (currently an end point that is usually
not even reached) Simple deployment of
state-of-the-art algs not previously available to
a biology lab, available over the web or via
local SEBINI install.
Advance the field by improving the algorithms.
Possibility of combining alg output (as done in
GRAIL), now that alg results are stored in same
database. Develop expertise on how much data is
needed, appropriate cutoffs, species-specific
post-processing, the weaknesses of a given
method, what background information on a genome
is most useful to supplement the primary
expression data.
Network biology is only in its infancy (2004,
Barabasi). Nobody knows what inference
algorithm(s) will perform best - theoretical
guidance is lacking. But SEBINI will position us
to empirically test new algorithms, easily modify
or combine algs - possibly with species specific
information.

25
DOE Science Undergraduate Laboratory Internship
(SULI) program

DOE Science Undergraduate Lab Internship (SULI)
program. Year-round. Duration 10 wks summer,
12-16 wks in fall or spring. 400/week housing
http//science-ed.pnl.gov/undergrad/erulf.stm.
Manager Karen Wieda kj.wieda_at_pnl.gov, (509)
375-3811.
Of 200 applications specifying PNNL as 1st round
choice, 50 offers made last year. Summer term is
the most competitive. DOE pays most of cost.
PNNL Science Engineering Education - Fellowship
Services. Undergrad, grad, visiting scientists,
sabbaticals. 400-750/week, 5 months max.
http//science-ed.pnl.gov/studentops.stm
Manager Rebecca Janosky rebecca.janosky_at_pnl.gov
, (509) 375-2302.
About 900-1000 applications/yr, of which 250 get
an offer. Completed applications posted to
central site. Mentor/host pays the cost, plus 18
overhead.

26
Take-home SULI information

Apply by early January 2007 for summer 2007. PNNL
sees students who selected it as first choice on
Feb 1.
SULI manager Ms. Karen Wieda. karen.wieda_at_pnl.gov
Phone (509) 375-3811
Ronald Taylor (for the SEBINI project)
ronald.taylor_at_pnl.gov
PNNL science education web site
http//science-ed.pnl.gov/ students/
Other background web sites www.pnl.gov,
www.sysbio.org, genomicsgtl.energy.gov/compbio/

27
Acknowledgements

For work on SEBINI Anuj Shah for work on
Cytoscape, MI alg translation, statistics),
Meridith Blevins (SULI), Charles Treatman (SULI)
Funding from the US Dept of Energy, through the
PNNL Biomolecular Systems Initiative, the EMSL
Membrane Grand Challenge at PNNL, and the joint
PNNL/ORNL protein-protein interaction network
mapping GTL project

Write a Comment

User Comments (0)