talk in bioitworld2002 - PowerPoint PPT Presentation

About This Presentation
Title:

talk in bioitworld2002

Description:

Limsoon Wong. Laboratories for Information Technology. Singapore. From Datamining ... Stop codon. Codon bias. Signal Integration. kNN ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 40
Provided by: Limsoo
Category:

less

Transcript and Presenter's Notes

Title: talk in bioitworld2002


1
From Datamining to Bioinformatics
Limsoon Wong Laboratories for Information
Technology Singapore
2
What is Bioinformatics?
3
Themes of Bioinformatics
Bioinformatics Data Mgmt Knowledge
Discovery Data Mgmt Integration
Transformation Cleansing Knowledge Discovery
Statistics Algorithms Databases
4
Benefits of Bioinformatics
To the patient Better drug, better treatment To
the pharma Save time, save cost, make more To
the scientist Better science
5
From Informatics to Bioinformatics
MHC-Peptide Binding (PREDICT)
Protein Interactions Extraction (PIES)
8 years of bioinformatics RD in Singapore
Gene Expression Medical Record Datamining (PCL)
Cleansing Warehousing (FIMM)
Gene Feature Recognition (Dragon)
Integration Technology (Kleisli)
Venom Informatics
1994
1998
1996
2002
2000
ISS
LIT
KRDL
6
Quick Samplings
7
Epitope Prediction
TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYS
E EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIH
LYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDA
LLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKI
AVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAV
CVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CE
EERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPN
PEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNP
EDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQ
SDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREE
HE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPY
AGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
8
Epitope Prediction Results
  • Prediction by our ANN model for HLA-A11
  • 29 predictions
  • 22 epitopes
  • 76 specificity
  • Prediction by BIMAS matrix for HLA-A1101

Number of experimental
binders 19 (52.8) 5 (13.9)
12 (33.3)
Rank by BIMAS
9
Transcription Start Prediction
10
Transcription Start Prediction Results
11
Medical Record Analysis
  • Looking for patterns that are
  • valid
  • novel
  • useful
  • understandable

12
Gene Expression Analysis
  • Classifying gene expression profiles
  • find stable differentially expressed genes
  • find significant gene groups
  • derive coordinated gene expression

13
Medical Record Gene Expression Analysis Results
  • PCL, a novel emerging pattern method
  • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI
    benchmarks
  • Works well for gene expressions

Cancer Cell, March 2002, 1(2)
14
Behind the Scene
  • Allen Chong
  • Judice Koh
  • SPT Krishnan
  • Huiqing Liu
  • Seng Hong Seah
  • Soon Heng Tan
  • Guanglan Zhang
  • Zhuo Zhang
  • Vladimir Bajic
  • Vladimir Brusic
  • Jinyan Li
  • See-Kiong Ng
  • Limsoon Wong
  • Louxin Zhang

and many more students, folks from
geneticXchange, MolecularConnections, and other
collaborators.
15
Questions?
16
A More Detailed Account
17
What is Datamining?
Jonathans rules Blue or Circle Jessicas
rules All the rest
18
What is Datamining?
Question Can you explain how?
19
The Steps of Data Mining
  • Training data gathering
  • Signal generation
  • k-grams, colour, texture, domain know-how, ...
  • Signal selection
  • Entropy, ?2, CFS, t-test, domain know-how...
  • Signal integration
  • SVM, ANN, PCL, CART, C4.5, kNN, ...

20
Translation Initiation Recognition
21
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo
sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAA
CACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCA
GCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGG
CCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAG
GACAAGACCTTCCACCCAACAAGGACTCCCCT .................
...........................................
80 ................................iEEEEEEEEEEEEEE
EEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEE
What makes the second ATG the translation
initiation site?
22
Signal Generation
  • K-grams (ie., k consecutive letters)
  • K 1, 2, 3, 4, 5,
  • Window size vs. fixed position
  • Up-stream, downstream vs. any where in window
  • In-frame vs. any frame

23
Too Many Signals
  • For each value of k, there are
  • 4k 3 2 k-grams
  • If we use k 1, 2, 3, 4, 5, we have
  • 4 24 96 384 1536 6144 8188
  • features!
  • This is too many for most machine learning
    algorithms

24
Signal Selection (Basic Idea)
  • Choose a signal w/ low intra-class distance
  • Choose a signal w/ high inter-class distance
  • Which of the following 3 signals is good?

25
Signal Selection (eg., t-statistics)
26
Signal Selection (eg., MIT-correlation)
27
Signal Selection (eg., ?2)
28
Signal Selection (eg., CFS)
  • Instead of scoring individual signals, how about
    scoring a group of signals as a whole?
  • CFS
  • A good group contains signals that are highly
    correlated with the class, and yet uncorrelated
    with each other
  • Homework find a formula that captures the key
    idea of CFS above

29
Sample k-grams Selected
Leaky scanning
Kozak consensus
  • Position 3
  • in-frame upstream ATG
  • in-frame downstream
  • TAA, TAG, TGA,
  • CTG, GAC, GAG, and GCC

Stop codon
Codon bias
30
Signal Integration
  • kNN
  • Given a test sample, find the k training samples
    that are most similar to it. Let the majority
    class win.
  • SVM
  • Given a group of training samples from two
    classes, determine a separating plane that
    maximises the margin of error.
  • Naïve Bayes, ANN, C4.5, ...

31
Results (on Pedersen Nielsens mRNA)
32
Acknowledgements
  • Roland Yap
  • Zeng Fanfan
  • A.G. Pedersen
  • H. Nielsen

33
Questions?
34
Common Mistakes
35
Self-fulfilling Oracle
  • Consider this scenario
  • Given classes C1 and C2 w/ explicit signals
  • Use ?2 to C1 and C2 to select signals s1, s2, s3
  • Run 3-fold x-validation on C1 and C2 using s1,
    s2, s3 and get accuracy of 90
  • Is the accuracy really 90?
  • What can be wrong with this?

36
Phil Longs Experiment
  • Let there be classes C1 and C2 w/ 100000 features
    having randomly generated values
  • Use ?2 to select 20 features
  • Run k-fold x-validation on C1 and C2 w/ these 20
    features
  • Expect 50 accuracy
  • Get 90 accuracy!
  • Lesson choose features at each fold

37
Apples vs Oranges
  • Consider this scenario
  • Fanfan reported 89 accuracy on his TIS
    prediction method
  • Hatzigeorgiou reported 94 accuracy on her TIS
    prediction method
  • So Hatzigeorgious method is better
  • What is wrong with this conclusion?

38
Apples vs Oranges
  • Differences in datasets used
  • Fanfans expt used Pedersens dataset
  • Hatzigeorgious used her own dataset
  • Differences in counting
  • Fanfans expt was on a per ATG basis
  • Hatzigeorgious expt used the scanning rule and
    thus was on a per cDNA basis
  • When Fanfan ran the same dataset and count the
    same way as Hatzigeorgiou, got 94 also!

39
Questions?
Write a Comment
User Comments (0)
About PowerShow.com