talk in bioitworld2002 - PowerPoint PPT Presentation

About This Presentation

Title:

talk in bioitworld2002

Description:

Limsoon Wong. Laboratories for Information Technology. Singapore. From Datamining ... Stop codon. Codon bias. Signal Integration. kNN ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 40

Provided by: Limsoo

Category:

more less

Transcript and Presenter's Notes

Title: talk in bioitworld2002

1
From Datamining to Bioinformatics
Limsoon Wong Laboratories for Information
Technology Singapore
2
What is Bioinformatics?
3
Themes of Bioinformatics
Bioinformatics Data Mgmt Knowledge
Discovery Data Mgmt Integration
Transformation Cleansing Knowledge Discovery
Statistics Algorithms Databases
4
Benefits of Bioinformatics
To the patient Better drug, better treatment To
the pharma Save time, save cost, make more To
the scientist Better science
5
From Informatics to Bioinformatics
MHC-Peptide Binding (PREDICT)
Protein Interactions Extraction (PIES)
8 years of bioinformatics RD in Singapore
Gene Expression Medical Record Datamining (PCL)
Cleansing Warehousing (FIMM)
Gene Feature Recognition (Dragon)
Integration Technology (Kleisli)
Venom Informatics
1994
1998
1996
2002
2000
ISS
LIT
KRDL
6
Quick Samplings
7
Epitope Prediction
TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYS
E EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIH
LYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDA
LLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKI
AVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAV
CVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CE
EERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPN
PEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNP
EDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQ
SDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREE
HE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPY
AGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
8
Epitope Prediction Results

Prediction by our ANN model for HLA-A11
29 predictions
22 epitopes
76 specificity

Prediction by BIMAS matrix for HLA-A1101

Number of experimental
binders 19 (52.8) 5 (13.9)
12 (33.3)
Rank by BIMAS
9
Transcription Start Prediction
10
Transcription Start Prediction Results
11
Medical Record Analysis

Looking for patterns that are
valid
novel
useful
understandable

12
Gene Expression Analysis

Classifying gene expression profiles
find stable differentially expressed genes
find significant gene groups
derive coordinated gene expression

13
Medical Record Gene Expression Analysis Results

PCL, a novel emerging pattern method
Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI
benchmarks
Works well for gene expressions

Cancer Cell, March 2002, 1(2)
14
Behind the Scene

Allen Chong
Judice Koh
SPT Krishnan
Huiqing Liu
Seng Hong Seah
Soon Heng Tan
Guanglan Zhang
Zhuo Zhang

Vladimir Bajic
Vladimir Brusic
Jinyan Li
See-Kiong Ng
Limsoon Wong
Louxin Zhang

and many more students, folks from
geneticXchange, MolecularConnections, and other
collaborators.
15
Questions?
16
A More Detailed Account
17
What is Datamining?
Jonathans rules Blue or Circle Jessicas
rules All the rest
18
What is Datamining?
Question Can you explain how?
19
The Steps of Data Mining

Training data gathering
Signal generation
k-grams, colour, texture, domain know-how, ...
Signal selection
Entropy, ?2, CFS, t-test, domain know-how...
Signal integration
SVM, ANN, PCL, CART, C4.5, kNN, ...

20
Translation Initiation Recognition
21
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo
sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAA
CACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCA
GCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGG
CCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAG
GACAAGACCTTCCACCCAACAAGGACTCCCCT .................
...........................................
80 ................................iEEEEEEEEEEEEEE
EEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEE
What makes the second ATG the translation
initiation site?
22
Signal Generation

K-grams (ie., k consecutive letters)
K 1, 2, 3, 4, 5,
Window size vs. fixed position
Up-stream, downstream vs. any where in window
In-frame vs. any frame

23
Too Many Signals

For each value of k, there are
4k 3 2 k-grams
If we use k 1, 2, 3, 4, 5, we have
4 24 96 384 1536 6144 8188
features!
This is too many for most machine learning
algorithms

24
Signal Selection (Basic Idea)

Choose a signal w/ low intra-class distance
Choose a signal w/ high inter-class distance
Which of the following 3 signals is good?

25
Signal Selection (eg., t-statistics)
26
Signal Selection (eg., MIT-correlation)
27
Signal Selection (eg., ?2)
28
Signal Selection (eg., CFS)

Instead of scoring individual signals, how about
scoring a group of signals as a whole?
CFS
A good group contains signals that are highly
correlated with the class, and yet uncorrelated
with each other
Homework find a formula that captures the key
idea of CFS above

29
Sample k-grams Selected
Leaky scanning
Kozak consensus

Position 3
in-frame upstream ATG
in-frame downstream
TAA, TAG, TGA,
CTG, GAC, GAG, and GCC

Stop codon
Codon bias
30
Signal Integration

kNN
Given a test sample, find the k training samples
that are most similar to it. Let the majority
class win.
SVM
Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error.
Naïve Bayes, ANN, C4.5, ...

31
Results (on Pedersen Nielsens mRNA)
32
Acknowledgements

Roland Yap
Zeng Fanfan
A.G. Pedersen
H. Nielsen

33
Questions?
34
Common Mistakes
35
Self-fulfilling Oracle

Consider this scenario
Given classes C1 and C2 w/ explicit signals
Use ?2 to C1 and C2 to select signals s1, s2, s3
Run 3-fold x-validation on C1 and C2 using s1,
s2, s3 and get accuracy of 90
Is the accuracy really 90?
What can be wrong with this?

36
Phil Longs Experiment

Let there be classes C1 and C2 w/ 100000 features
having randomly generated values
Use ?2 to select 20 features
Run k-fold x-validation on C1 and C2 w/ these 20
features
Expect 50 accuracy
Get 90 accuracy!
Lesson choose features at each fold

37
Apples vs Oranges

Consider this scenario
Fanfan reported 89 accuracy on his TIS
prediction method
Hatzigeorgiou reported 94 accuracy on her TIS
prediction method
So Hatzigeorgious method is better
What is wrong with this conclusion?

38
Apples vs Oranges

Differences in datasets used
Fanfans expt used Pedersens dataset
Hatzigeorgious used her own dataset
Differences in counting
Fanfans expt was on a per ATG basis
Hatzigeorgious expt used the scanning rule and
thus was on a per cDNA basis
When Fanfan ran the same dataset and count the
same way as Hatzigeorgiou, got 94 also!

39
Questions?

Write a Comment

User Comments (0)