Guan N. Lin (Nick) - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Guan N. Lin (Nick)

Description:

Bioinformatics Prediction of Plant Protein-Protein Interaction Using sequence Only Guan N. Lin (Nick) Bioinformatics Intern innovation collaboration speed Project ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 33
Provided by: cane6
Category:
Tags: guan | lin | monsanto | nick

less

Transcript and Presenter's Notes

Title: Guan N. Lin (Nick)


1
Bioinformatics
Prediction of Plant Protein-Protein Interaction
Using sequence Only
Guan N. Lin (Nick) Bioinformatics Intern
2
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • PPI prediction for leading genes
  • Acknowledgements

3
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • PPI prediction for leading genes
  • Acknowledgements

4
Protein-protein interaction (PPI)
  • PPI
  • Each living cell is packed with proteins that
    continuously interact with each other to control
    the cell's growth, function and eventual fate.
  • They have effects on altering protein kinetic
    properties, substrate binding, catalysis, etc.
  • Researchers have developed a variety of chemical
    and biochemical techniques to understand the who,
    what, where, when and why of those interactions.

5
Systems biology From cell to network
6
PPI (Protein-protein interaction) prediction
  • A study combining bioinformatics and structural
    biology to identify and catalog interactions
    between pairs or groups of proteins.
  • Determination by experiments
  • yeast 2-hybrids, affinity purification,
    co-immunoprecipitation, etc.
  • Prediction by computations
  • Model building through pattern discovery using
    sequences, protein structural information,
    evolutionary information, etc.
  • PPI network construction provides important
    insight in investigating intracellular signaling
    pathways.

7
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • PPI prediction for leading genes
  • Acknowledgements

8
Project goals and obstacles
  • Goals
  • Using parts of free tools and open-source codes
    to build a PPI prediction pipeline system based
    on protein sequence information only.
  • Using cross-species PPI data, such as Human,
    Drosophila, Yeast and C. elegans, to do
    genome-scale plant PPI prediction.
  • Obstacles
  • Open-source codes lack of organizations and
    descriptions for system integration.
  • Computational complexity hinders the analysis
    speed within limited amount of time.
  • Difficult to generalize the consistent pattern
    from cross species data.

9
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Basic scheme
  • Tool development
  • Model design and tuning
  • Results Analysis
  • PPI prediction for leading genes
  • Acknowledgements

10
Basic scheme (how do we do it?)
Rationale PPIs are basic structural elements for
molecular circuitries in biological systems and
will provide valuable insights for
optimization/MOA
Training data (sequences of interacting proteins)
Predict new interactions from sequences
SVM Kernel classifier
Sequence patterns
Validation
Training set for SVM kernel classifier
Positive training set (experimental interactions,
some for training, some for validation)
Negative training set (mostly random generated
pairs)
11
Using Conjoint Triads for sequence pattern
construction
  • Reduced-alphabet sequence pattern training
  • Classify 20 AA types into 7 classes based on
    their properties (hydrogen bonding, hydrophobic,
    volumes of sidechains, etc).
  • Build AA triplets using 7 classes, called
    conjoint triad (343 unique types). Save in V
  • Calculate frequency of each triad for each
    protein sequence.

Shen, PNAS 2007
12
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Basic scheme
  • Tool development
  • Model design and tuning
  • Results Analysis
  • PPI prediction for leading genes
  • Acknowledgements

13
System/Tool design flowchart
Java Codes
C/C Codes
SVM Prediction
Input Sequence
SVM Training
Build sequence pattern
Test sequences
Conjoint Triads
Optimize parameter (C, ?)
SVM test input SVM training model
Triads Frequency
Build SVM training model
Prediction
SVM training Input
Negative PPI pairs are generated based on
proteins positive PPI pairs. If AB and IJ are
positive PPIs, then AI, AJ, BI and BJ could be
considered the negative pairs. of negative
pairs of positive pairs
Generate negative PPI pairs
Prepare training Evidence
Raw PPI file
14
Screenshot of the PPI prediction tool
15
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Basic scheme
  • Tool development
  • Model design and tuning
  • Results Analysis
  • PPI prediction for leading genes
  • Acknowledgements

16
Public available experimental data
  • Arabidopsis
  • 4,400 PPI pairs (Tair, Biogrid, intAct), 3,000
    genes
  • C. elegans
  • 5,400 PPI pairs (Biogrid, intAct)
  • Human
  • 23,000 PPI pairs (HPRD, intAct), 6,900 genes
  • Drosophila
  • 24,000 PPI pairs (intAct), 7,000 genes
  • Yeast
  • 48,000 PPI pairs (Biogrid, intAct), 7,000 genes

17
SVM for triad pattern model training and tuning
SVM training parameters
SVM parameters optimization is performed using
grid-search procedure. Parameters C cost to
minimize training error (value range 0.125 -gt
512) ? kernel gamma maximize training
capability (value range 0.125 -gt 8)
18
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • Preliminary results and problems
  • Further method modification
  • Further results
  • PPI prediction for leading genes
  • Acknowledgements

19
Accuracy measurements
Real Outcome
  TRUE FALSE
TRUE TP (True Positive) FP (False Positive)
FALSE FN (False Negative) TN (True Negative)
Predicted Outcome
Sensitivity
Specificity
Sensitivity TP/(TP FN) Specificity TN/(FP
TN)
20
Preliminary results and observations
  • Prediction for Arabidopsis 2,600 positive PPI
    2,600 negative PPI using different data sets
    without any filtering or processing.

Species
Arabidopsis X X X X
Human X X X
Yeast X X X
Drosophila X
C. Elegans X
Accuracy TP 45 TN 86 TP 2 TN 96 TP 98 TN 3 TP 10 TN 92 TP 63 TN 38 TP 25 TN 55
Observations 1. Overall low accuracies. 2.
Different species data exhibit very different
prediction pattern, some like Human and Yeast
have completely different prediction extreme
patterns. gt Conclusion not meaningful and
useful predictions so far.
21
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • Preliminary results and problems
  • Further method modification
  • Further results
  • PPI prediction for leading genes
  • Acknowledgements

22
  • Carefully selection of subsets of cross species
    data for training is essential to get valid
    results
  • Using GO (Gene Ontology) slim category for data
    filtering
  • Red bar Arabidopsis whole genome proteins
  • Blue bar Arabidopsis PPI proteins
  • It shows correlation of 0.92 between them
  • Proteins from PPI does represent overall trend of
    whole genome
  • Filtering species data by GO Tair slim.

23
How to categorize proteins into GO slim terms -
using GO level indexing
Step1 make GO index
Ontology files
Step2 link GO index to genes
Gene to GO association Files
Step3 get GO slim term GO_Index
YBR085W -gt GO0055085 transmembrane transport
3-10-5-44
Developmental process(GO0007252)
3-9-26 Transport(GO0006810) 3-10-5 Signal
transduction(GO0007165) 3-7-7-15
YBR085W belongs to Transport slim category
24
Next step Using slim category frequency
distribution to select subsets of cross-species
data
Use percentages shown in Arabidopsis data to
select similar subsets for other species
25
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • Preliminary results and problems
  • Further method modification
  • Further results
  • PPI prediction for leading genes
  • Acknowledgements

26
Models results comparison with modified datasets
Data selection Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 YuChen
Arabidopsis X ?
Human X X
Yeast X X ?
C. elegans X X
Drosophila X X
Accuracy
True positive 944/1917 49.24 433/1917 22 492/1917 25.7 501/1917 26.1 440/1917 23 756/1917 39.4 51/1917 2.7
True negative 1647/1915 86 1724/1915 90 1673/1915 87.4 1473/1915 76.9 1665/1915 86.9 1418/1915 74 NA
Sensitivity 78 69 67 53 64 72 NA
Specificity 64 55 54 51 53 58 NA
Test data 1917 positive-evidence Arabidopsis PPI
pairs 1915 negative Arabidopsis PPI pairs. The
probability of predict a random pair to be a true
PPI is 2.6. Observation The modified datasets
are able to remove almost all negative pairs.
27
Using ROC curves to show the powers of model
prediction are much better than random prediction.
Prepared by Xiao Yang
28
Model prediction pattern correlations
Prediction proabability correlation Prediction proabability correlation Prediction proabability correlation Prediction proabability correlation      
  arab human yeast c. elegans cdhy drosophila
arab 1 0.123131 0.158611 0.177077 0.250751 0.047598
human   1 0.603687 0.210406 0.09851 0.553002
yeast     1 0.309629 0.47832 0.290454
c. elegans       1 0.242561 0.101113
combined         1 -0.22017
drosophila           1
Note 1. "cdhs means combining c. elegans,
drosophila, human and yeast data together. 2.
Drosophila dataset has the poorest prediction
trend correlation with Arabidopsis dataset. 3.
Combined dataset exhibits the stronger
correlation than any other individual dataset.
29
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • Preliminary results and problems
  • Further method modification
  • Further results
  • PPI prediction for leading genes
  • Acknowledgements

30
Summary
  • Built an easy use and successful system for PPI
    prediction based on sequence information only.
  • Construct the PPI prediction models and prove the
    concept of using cross-species information for
    plant species PPI prediction in case of lacking
    of experimental information.
  • Apply PPI prediction for leading genes MOA study.

31
Outline
  • Project Background
  • Goals and Obstacles
  • Tool Development Method Design
  • Results Analysis
  • Preliminary results and problems
  • Further method modification
  • Further results
  • PPI prediction for leading genes
  • Acknowledgements

32
Acknowledgements
  • Zheng Li
  • J.D. Liu
  • Everyone in bioinformatics team
  • Paggy Sullivan and University relations
Write a Comment
User Comments (0)
About PowerShow.com