Title: Ronald Taylor, Ph.D.
1 The Network Inference Problem and the SEBINI
Platform
- Ronald Taylor, Ph.D.
- Computational Biology Bioinformatics Group
- Computational Sciences Mathematics Division
- Pacific Northwest National Laboratory (PNNL)
- Richland, Washington
- Email ronald.taylor_at_.pnl.gov
-
- Systems Biology at PNNL http//www.sysbio.org/
2DOEs GenomicsGTL program
- Follow-up to the Human Genome Project. (DOE
launched the HGP in 1986. GenBank started at a
DOE lab.) - Goal to comprehensively understand cellular
processes in a realistic context, i.e., systems
biology. To be accomplished using high-throughput
advanced technologies and computation (petabyte
scale databases, integrated knowledgebases,
network modeling) - Under the direction of the DOE Office of Science
and its suboffices, the Office of Biological and
Environmental Research (OBER) and the Office of
Advanced Scientific Computing Research (OASCR). - Focused on microbial organisms. (Genomes
sequenced in the DOE Microbial Genome Program.)
3DOE Genomics GTL Program Goals - The Role of
Living Systems in Energy Production,
Environmental Remediation, and Carbon Cycling and
Sequestration
- Molecular Proteins and multicomponent molecular
machines that perform most of the cell's work - Cellular Gene regulatory networks and pathways
that control cellular processes - Community Microbial communities in which groups
of cells carry out complex processes in nature
4Methods used to provide various data for
inferring regulatory networks (I)
- Prediction of transcription factor (TF) binding
sites e.g., Dr. Lee Ann McCues work at PNNL,
also MotifMogul software at ISB. - Public and commercial databases (for example
TRANSFAC, SCPD), gradually collecting wet lab
experiments that identify TFs and their binding
sites, one by one. - Tiling arrays (expensive).
- Projects to find protein-protein interactions
e.g., PNNL/ORNL GTL project (expensive). - Computational algorithms that infer regulatory
edges based (primarily) on correlations in state
the algorithms used in SEBINI. Require a large
amount of array or protein expression data. More
powerful algorithms/models are needed to infer
specific gene-to-gene connections than the
typical statistical techniques used for
clustering.
5Methods used to provide various data for
inferring regulatory networks (II)
- There are drawbacks to all methods. TFBS
prediction based on sequence and phylogenetic
comparison is very hard. Tiling arrays are
promising, but new (expensive - test one putative
source TF at a time). Drawbacks to both
dependence on nearness of TFBS to gene to infer
target. In eukaryotes, 70 of TFs bind far from
their targets (A. Aderem). Also neither yields
regulation type (activator / inhibitor), just
that there is binding. Also it is common in
bacteria for a TF to lie between genes
transcribed in different directions. May regulate
one or both - which choice is unknown from tiling
and TFBS prediction. - As for determining interaction networks using
mass spec not yet high-throughput, in terms of
results.
6Conclusion
- Computational algorithms that are based on
correlations in state will be continue to be
used, remaining a standard approach for many
years to come. - Bonus gathering the large amount of array data
required for their use provides the raw data for
investigation of state functions topic for
future research.
7Software Environment for BIological Network
Inference (SEBINI) - Introduction
SEBINI has been created to provide an interactive
environment for the evaluation and deployment of
algorithms used in the reconstruction of the
structure of biological regulatory networks.
SEBINI compares and trains network inference
methods on artificial networks and simulated gene
expression perturbation data. It also allows the
analysis within the same framework of
experimental high-throughput expression data
using the suite of (trained) inference methods.
Hence SEBINI should be useful both to software
developers wishing to evaluate, compare, refine,
or combine inference techniques, and to
bioinformaticians (or biologists) analyzing
experimental data. SEBINI provides a platform
that aids in more accurate reconstruction of
regulatory and interaction networks, with much
less effort, in less time.
8(No Transcript)
9(No Transcript)
10Software Environment for BIological Network
Inference (SEBINI)
PNNLs Bioinformatics Resource Manager (BRM)
Input Module High-throughput experimental data
Builder Module Simulated high-throughput expressio
n data for artificial networks
Text files (flat files)
Visualization of inferred networks via Cytoscape
PNNLs PRISM database system
Human-readable reports on inferred networks
SEBINI Central relational database (PostgreSQL)
User interface web site operated by Java
servlets
Machine-readable network structure files for
dynamic modeling programs
Topological statistics, network annotation,
post-inference processing scoring error
analysis (on artificial data sets)
Collection of network inference algorithms. User
selects algorithm and data set, runs alg to infer
a network (a set of edges). Mutual
information-based and Bayesian network structure
learning algorithms provided for learning
regulatory networks. Also PNNL/ORNL algorithm
for learning protein-protein interaction networks
from PNNL/ORNL bait-prey experiment mass spec
data sets. Inferred networks permanently stored
back into database.
11SEBINI architecture implementation (I)
- 100 Java programs (classes) and growing
rapidly. - All inter-servlet communication is routed through
a CentralControl class. Algorithm handlers are
called directly from the Java servlets for the
corresponding web pages. This environment is is
NOT a spiderweb there is a control chokepoint. - 30 PostgreSQL database tables. Slowly growing
at present quite stable. One major database
change coming. - Data security project based. Upon login, the
user is assigned a 32 digit hex digit JSessionID,
which is checked before display of every web
page.
12SEBINI architecture implementation (II)
- While the SEBINI was originally designed to
infer directed (regulatory) networks, the code
now allows undirected networks, so algorithms
that infer interaction networks can be used and
such networks (e.g., protein-protein interaction
networks) permanently stored and analyzed. - Design issues interface for user navigation
among huge data sets, database design to map
inferred networks and inferred edges back to
original network and expression data IDs must
be carried forward. - expression data ? one-to-many via binning alg
choice ? binned exp data ? one-to-many via
inference alg choice ? inferred_network
13SEBINI architecture implementation (III)
- Design issues (continued) multi-threaded job
monitoring web pages to view all data points,
working towards transparency of all data in all
tables (security access permitting) in the
database via display through the web site. - Permanent storage of binned/ pre-processed data
sets. Novel. Important for efficiency,
transparency, speed of response, analysis of
results. - Jobs times recorded to millisec. Algorithms can
be compared on efficiency vs relative power. - A Java handler class is created for each new
algorithm, to wrap it, ie, to handle
communication with the database and web site.
14SEBINI architecture implementation (IV)
- SEBINI was initially implemented on a Dell
desktop running Red Hat Linux, using Java ver.
1.4, PostgreSQL ver. 7.4, and Tomcat 4.1.
Cytoscape is used for network visualization,
invoked through Java Web Start. - SEBINI has also been installed on a Windows web
server that will soon be accessible from outside
the PNNL, using Java ver. 1.5, Tomcat ver. 5, and
PostgreSQL 8.1. Jakarta Commons Java libraries
are used for data file uploads Jakarta. - Machine-specific parameters are stored in an
easily changed properties text file. - development site http//asimov.emsl.pnl.gov8080/
NIT/NIT.html - public demo site https//www.emsl.pnl.gov/NIT/NIT
.html
15Possible sources of inference algorithms
- Probabilistic graphical models (structure
learning Bayesian networks, among others) - Information theory (mutual information based, and
CMI) - Classical statistics, analysis of correlation -
e.g., Pearson correlation - Machine learning decision trees (C4.5, ID3),
supervised and unsupervised - Data mining association rule mining (really
want to try this) - Pattern classification
- Deductive reasoning
- Neural networks
- Fuzzy logic?
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22SEBINI flow of control, for experimental data
- Log into a project, or create a new project
- Create a network set (container) for the one
experimental network - Upload the experimental expression data file
- Select a binning algorithm and bin the data
- Select an inference algorithm, select the alg
parameters, and infer a network. - Visualize the inferred network, now in the
database, using Cytoscape - View topological statistics
- View node and edge annotations
- Generate a human-readable report
- Export the topology in a format suitable for
input into dynamic simulations
23SEBINI flow of control, for synthetic data
- Log into a project, or create a new project
- Select a topology build algorithm, enter param
values, and create a synthetic network set - Select an expression set build algorithm, enter
param values, and create synthetic expression
sets - Select a binning algorithm and bin the data
- Select an inference algorithm, enter param
values, and infer network - Visualize and compare the real and inferred
network(s), using Cytoscape - View topological statistics
- View precision, recall, F-measure statistics for
precise measure of how well the inference alg
performed against the gold standard, the known
synthetic network.
24Some goals for SEBINI
- Make network inference a starting point, not an
end point (currently an end point that is usually
not even reached) Simple deployment of
state-of-the-art algs not previously available to
a biology lab, available over the web or via
local SEBINI install. - Advance the field by improving the algorithms.
Possibility of combining alg output (as done in
GRAIL), now that alg results are stored in same
database. Develop expertise on how much data is
needed, appropriate cutoffs, species-specific
post-processing, the weaknesses of a given
method, what background information on a genome
is most useful to supplement the primary
expression data. - Network biology is only in its infancy (2004,
Barabasi). Nobody knows what inference
algorithm(s) will perform best - theoretical
guidance is lacking. But SEBINI will position us
to empirically test new algorithms, easily modify
or combine algs - possibly with species specific
information.
25DOE Science Undergraduate Laboratory Internship
(SULI) program
- DOE Science Undergraduate Lab Internship (SULI)
program. Year-round. Duration 10 wks summer,
12-16 wks in fall or spring. 400/week housing
http//science-ed.pnl.gov/undergrad/erulf.stm. - Manager Karen Wieda kj.wieda_at_pnl.gov, (509)
375-3811. - Of 200 applications specifying PNNL as 1st round
choice, 50 offers made last year. Summer term is
the most competitive. DOE pays most of cost. - PNNL Science Engineering Education - Fellowship
Services. Undergrad, grad, visiting scientists,
sabbaticals. 400-750/week, 5 months max.
http//science-ed.pnl.gov/studentops.stm - Manager Rebecca Janosky rebecca.janosky_at_pnl.gov
, (509) 375-2302. - About 900-1000 applications/yr, of which 250 get
an offer. Completed applications posted to
central site. Mentor/host pays the cost, plus 18
overhead. -
26Take-home SULI information
- Apply by early January 2007 for summer 2007. PNNL
sees students who selected it as first choice on
Feb 1. - SULI manager Ms. Karen Wieda. karen.wieda_at_pnl.gov
Phone (509) 375-3811 - Ronald Taylor (for the SEBINI project)
ronald.taylor_at_pnl.gov - PNNL science education web site
http//science-ed.pnl.gov/ students/ - Other background web sites www.pnl.gov,
www.sysbio.org, genomicsgtl.energy.gov/compbio/
27Acknowledgements
- For work on SEBINI Anuj Shah for work on
Cytoscape, MI alg translation, statistics),
Meridith Blevins (SULI), Charles Treatman (SULI) - Funding from the US Dept of Energy, through the
PNNL Biomolecular Systems Initiative, the EMSL
Membrane Grand Challenge at PNNL, and the joint
PNNL/ORNL protein-protein interaction network
mapping GTL project