Regression based KNN for gene function prediction using heterogeneous data sources PowerPoint PPT Presentation

presentation player overlay
1 / 13
About This Presentation
Transcript and Presenter's Notes

Title: Regression based KNN for gene function prediction using heterogeneous data sources


1
Regression based KNN for gene function prediction
using heterogeneous data sources
  • Zizhen Yao, Larry Ruzzo
  • yzizhen, ruzzo _at_cs.washington.edu

2
Background
  • E. Coli classification schemes
  • KEGG , COG, MultiFun
  • Common functional classes (10-19 classes)
  • Metabolism, Translation, Transporter, Cell
    Motility
  • Biological information used for inference
  • Microarray expression, protein interaction,
    evolutionary history
  • Methods
  • Support vector machine, Bayesian, Rule-based

3
Introduction to KNN
  • Idea for each query instance
  • Choose k nearest neighbors
  • Choose the class voted by majority of the
    neighbors.
  • Design issues
  • Similarity / Distance metric
  • Voting schemes

4
Algorithm Flow Chart
Training
Testing
Training Data
Testing Data
For every pair of training genes, calculate the
predictors.
Calculate the predictors values using and
training data
Learn Similarity Metric
Choose k nearest neighbors
Voting
A list of predictions with confidence scores.
5
Predictors
  • Microarray Expression Data
  • Expression correlation
  • Sequencing Data
  • Chromosomal position
  • Chromosomal distance
  • Transcription direction
  • Block indicator
  • Protein sequence similarity
  • Paralog indicator

6
Similarity (Distance) Metric
  • Classical metrics are not appropriate because
    predictors are
  • heterogeneous data type, scale
  • different relevance
  • correlated
  • Goal estimate the likelihood that a pair of
    genes are in the same class based on predictors

7
Learning Similarity Metric
  • Regression methods
  • Response
  • Find f
  • Logistic regression
  • Local regression

8
Probabilistic voting scheme
  • Goal estimate the probability that the query
    gene belong to each class.
  • Range 0 1
  • Assigns higher confidence score to predictions
    voted by more neighbors, or neighbors with higher
    credibility.
  • Report predictions that are above certain
    threshold value.

9
Performance comparison
10
Functional Classes ROC analysis (KEGG)
11
Confidence Score vs. Accuracy
12
Results Summary
  • Combining all 4 predictors yields the best
    result.
  • Using expression data only, regression based KNN
    methods outperforms SVM.
  • Performance varies with different function
    classes
  • Confidence scores are strongly correlated with
    accuracy.

13
Contribution
  • KNN
  • Simplicity, efficiency, flexibility
  • Easy to interpret the results, useful to guide
    case studies
  • Similarity metric
  • integrate heterogeneous data sources
  • voting scheme
  • Statistic inference
  • A general framework to incorporate other
    information.
Write a Comment
User Comments (0)
About PowerShow.com