A statistical framework for genomic data fusion - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

A statistical framework for genomic data fusion

Description:

Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index. ... hydropathy profile. KFFT. Pfam HMM. protein sequence. KHMM. BLAST. protein ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 68
Provided by: williamsta
Category:

less

Transcript and Presenter's Notes

Title: A statistical framework for genomic data fusion


1
A statistical framework for genomic data fusion
  • William Stafford Noble
  • Department of Genome Sciences
  • Department of Computer Science and Engineering
  • University of Washington

2
Outline
  • Recognizing correctly identified peptides
  • The support vector machine algorithm
  • Experimental results
  • Yeast protein classification
  • SVM learning from heterogeneous data
  • Results

3
Recognizing correctly identified peptides
4
Database search
Protein sample
Sequence database
Tandem mass spectrometer
Search algorithm
Observed spectra
Predicted peptides
5
The learning task
Theoretical
Observed
  • We are given paired observed and theoretical
    spectra.
  • Question Is the pairing correct?

6
Properties of the observed spectrum
  1. Total peptide mass. Too small yields little
    information too large (gt25 amino acids) yields
    uneven fragmentation.
  2. Charge (1, 2 or 3). Provides some evidence
    about amino acid composition.
  3. Total ion current. Proportional to the amount of
    peptide present.
  4. Peak count. Small indicates poor fragmentation
    large indicates noise.

7
Observed vs. theoretical spectra
  1. Mass difference.
  2. Percent of ions matched. Number of matched ions /
    total number of ions.
  3. Percent of peaks matched. Number of matched peaks
    / total number of peaks.
  4. Percent of peptide fragment ion current matched.
    Total intensity of matched peaks / total
    intensity of all peaks.
  5. Preliminary SEQUEST score (Sp).
  6. Preliminary score rank.
  7. SEQUEST cross-correlation (XCorr).

8
Top-ranked vs. second-ranked peptides
  1. Change in cross-correlation. Compute the
    difference in XCorr for the top-ranked and
    second-ranked peptide.
  2. Percent sequence identity. Usually
    anti-correlated with change in cross-correlation.

9
Negative examples
Positive examples
10
The support vector machine algorithm
11
SVMs in computational biology
  • Splice site recognition
  • Protein sequence similarity detection
  • Protein functional classification
  • Regulatory module search
  • Protein-protein interaction prediction
  • Gene functional classification from microarray
    data
  • Cancer classification from microarray data

12
Support vector machine
13
Support vector machine





-
-
Locate a plane that separates positive from
negative examples.


-

-


-
-
-
-
-

-
-
-


-
-

-
-
Focus on the examples closest to the boundary.
14
Kernel matrix
15
(No Transcript)
16
(No Transcript)
17
Kernel functions
  • Let X be a finite input space.
  • A kernel is a function K, such that for all x, z
    ? X, K(x, z) ?(x) ?(y), where ? is a mapping
    from X to an (inner product) feature space F.
  • Let K(x,z) be a symmetric function on X. Then
    K(x,z) is a kernel function if and only if the
    matrix
  • is positive semi-definite.

18
Peptide ID kernel function
  • Let p(x,y) be the function that computes a
    13-element vector of parameters for a pair of
    spectra, x and y.
  • The kernel function K operates on pairs of
    observed and theoretical spectra

19
Experimental results
20
Experimental design
  • Data consists of one 13-element vector per
    predicted peptide.
  • Each feature is normalized to sum to 1.0 across
    all examples.
  • The SVM is tested using leave-one-out
    cross-validation.
  • The SVM uses a second-degree polynomial,
    normalized kernel with a 2-norm asymmetric soft
    margin.

21
Three data sets
  • Set 1 Ion trap mass spectrometer. Sequest search
    on the full non-redundant database.
  • Set 2 Ion trap mass spectrometer. Sequest
    search on human NRDB.
  • Set 3 Quadrupole time-of-flight mass
    spectrometer. Sequence search on human NRDB.

22
Data set sizes
Total
Negative
Positive
976
479
497
Ion-trap NRDB
1161
465
696
Ion-trap HNRDB
1540
523
1017
QTOF HNRDB
23
(No Transcript)
24
(No Transcript)
25
0.94
0.95
0.99
26
(18,966)
(126,931)
(27,936)
(108,936)
(57,732)
27
(No Transcript)
28
(No Transcript)
29
Conversion to probabilities
  • Hold out a subset of the training examples.
  • Use the hold-out set to fit a sigmoid.
  • This is equivalent to assuming that the SVM
    output is proportional to the log-odds of a
    positive example.

y label f discriminant
30
(No Transcript)
31
Yeast protein classification
32
Membrane proteins
  • Membrane proteins anchor in a cellular membrane
    (plasma, ER, golgi, mitochondrial).
  • Communicate across membrane.
  • Pass through the membrane several times.

33
Heterogeneous data
sequence data
mRNA expression data
protein-protein interaction data
34
Vector representation
  • Each matrix entry is an mRNA expression
    measurement.
  • Each column is an experiment.
  • Each row corresponds to a gene.

35
(No Transcript)
36
Sequence kernels
gtICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENE
NQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGK
YVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILS
KSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTG
PDRH gtLACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVA
GTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGE
CAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPE
QSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI
  • We cannot compute a scalar product on a pair of
    variable-length, discrete strings.

37
Pairwise comparison kernel
38
Pairwise comparison kernel
39
Pairwise kernel variants
  • Smith-Waterman all-vs-all
  • BLAST all-vs-all
  • Smith-Waterman w.r.t. SCOP database
  • E-values from Pfam database

40
Protein-protein interactions
  • Pairwise interactions can be represented as a
    graph or a matrix.

41
Linear interaction kernel
1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0
0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0
1 0 1 0 0 0
3
  • The simplest kernel counts the number of
    interactions between each pair.

42
Diffusion kernel
  • A general method for establishing similarities
    between nodes of a graph.
  • Based upon a random walk.
  • Efficiently accounts for all paths connecting two
    nodes, weighted by path lengths.

43
Hydrophobicity profile
Membrane protein
Non-membrane protein
  • Transmembrane regions are typically hydrophobic,
    and vice versa.
  • The hydrophobicity profile of a membrane protein
    is evolutionarily conserved.

44
Hydrophobicity kernel
  • Generate hydropathy profile from amino acid
    sequence using Kyte-Doolittle index.
  • Prefilter the profiles.
  • Compare two profiles by
  • Computing fast Fourier transform (FFT), and
  • Applying Gaussian kernel function.
  • This kernel detects periodicities in the
    hydrophobicity profile.

45
SVM learning from heterogeneous data
46
Combining kernels
B
A
B
A
AB
K(A)
K(B)
Identical
K(AB)
K(A)K(B)
47
Semidefinite programming
  • Define a convex cost function to assess the
    quality of a kernel matrix.
  • Semidefinite programming (SDP) optimizes convex
    cost functions over the convex cone of positive
    semidefinite matrices.

48
Semidefinite programming
According to a convex quality measure
Learn K from the convex cone of
positive-semidefinite matrices or a convex subset
of it
Integrate constructed kernels Learn a
linear mix
Large margin classifier (SVM) Maximize
the margin
SDP
49
Integrate constructed kernels Learn a
linear mix
Large margin classifier (SVM) Maximize the
margin
50
Experimental results
51
Seven yeast kernels
Kernel Data Similarity measure
KSW protein sequence Smith-Waterman
KB protein sequence BLAST
KHMM protein sequence Pfam HMM
KFFT hydropathy profile FFT
KLI protein interactions linear kernel
KD protein interactions diffusion kernel
KE gene expression radial basis kernel
52
Membrane proteins
53
Comparison of performance
Simple rules from hydrophobicity profile
TMHMM
54
Cytoplasmic ribosomal proteins
55
False negative predictions
56
False negative expression profiles
57
(No Transcript)
58
Markov Random Field
  • General Bayesian method, applied by Deng et al.
    to yeast functional classification.
  • Used five different types of data.
  • For their model, the input data must be binary.
  • Reported improved accuracy compared to using any
    single data type.

59
Yeast functional classes
Category Size
Metabolism 1048
Energy 242
Cell cycle DNA processing 600
Transcription 753
Protein synthesis 335
Protein fate 578
Cellular transport 479
Cell rescue, defense 264
Interaction w/ evironment 193
Cell fate 411
Cellular organization 192
Transport facilitation 306
Other classes 81
60
Six types of data
  • Presence of Pfam domains.
  • Genetic interactions from CYGD.
  • Physical interactions from CYGD.
  • Protein-protein interaction by TAP.
  • mRNA expression profiles.
  • (Smith-Waterman scores).

61
Results
MRF
SDP/SVM (binary)
SDP/SVM (enriched)
62
Many yeast kernels
  • protein sequence
  • phylogenetic profiles
  • separate gene expression kernels
  • time series expression kernel
  • promoter regions using seven aligned species
  • protein localization
  • ChIP
  • protein-protein interactions
  • yeast knockout growth data
  • more ...

63
Future work
  • New kernel functions that incorporate domain
    knowledge.
  • Better understanding of the semantics of kernel
    weights.
  • Further investigation of yeast biology.
  • Improved scalability of the algorithm.
  • Prediction of protein-protein interactions.

64
Acknowledgments
  • Dave Anderson, University of Oregon
  • Wei Wu, Genome Sciences, UW FHCRC
  • Michael Jordan, Statistics EECS, UC Berkeley
  • Laurent El Ghaoui, EECS, UC Berkeley
  • Gert Lanckriet, EECS, UC Berkeley
  • Nello Cristianini, Statistics, UC Davis

65
Fisher criterion score
Low score
High score
66
Feature ranking
delta Cn 2.861 match total ion current
2.804 Cn 2.444 match peaks 2.314 Sp
1.158 mass 0.704 charge 0.488 rank Sp
0.313 peak count 0.209 sequence similarity
0.115 ion match 0.079 total ion current
0.026 delta mass 0.024
67
Pairwise feature ranking
match TIC-delta Cn 4.741 match peaks-delta
Cn 4.233 match TIC-Cn 3.819 delta Cn-Cn
3.597 delta Cn-charge 3.563 match
peaks-Cn 3.377 delta Cn-mass 3.119 match TIC-
match peaks 2.823 ion match-delta Cn
2.812 Sp-delta Cn 2.799 match TIC-Sp 2.579
match peaks-Sp 2.383
match TIC-mass 2.097 ion match-mass
2.091 Cn-charge 1.943 Sp-mass 1.922 match
TIC-charge 1.898 Cn-mass 1.884 Sp-Cn 1.881
ion match-Cn 1.827 Sp-charge 1.770 match
peaks-mass 1.668 match peaks-charge 1.528
match TIC- ion match 1.473
Write a Comment
User Comments (0)
About PowerShow.com