Title: Bioinformatics, Data Integration and Machine Learning a Thesis Proposal
1Bioinformatics, Data Integration and Machine
Learninga Thesis Proposal
- Kaushik Sinha
- Supervisors Prof. Gagan Agrawal and Prof.
Mikhail Belkin
2Roadmap
- Motivation
- Our Approach
- Current Work
- Learning Layouts of Flat-file Biological Datasets
- Exploratory Tools for Biological Data Analysis
- Proposed Work
- Deep Web Mining for Biological Data
- Semi-supervised Ranking
- Multiple-instance Learning
- Conclusion
3Motivation
- Integration is hard
- Data explosion
- Data size number of data sources
- New analysis tools
- Autonomous resources
- Heterogeneous data representation various
interfaces - Frequent Updates
- New trend web and grid services
4Motivation contd
- In recent years DNA microarry and other gene and
protein assays have become essential tools for
biologists - Next step of biological enquiry is to find out
- What is known about these genes?
- How are these genes related to each other or
other genes identified in similar studies? - However, major difficulties are
- How do we extract key properties shared by a
candidate genes? - How do we generate reasonable hypothesis to
explain them? - How do we define and evaluate similarity between
sets of genes?
5Motivating Example
- Suppose after a micro array experiment a
biologist suspects that a small set of genes are
related to a disease - This can be confirmed by searching existing
literature - One would expect related genes to appear together
in literature - Due to sheer volume
- Searching is time consuming and error prone
- Some complications could arise as well
- However, suppose Gene A and C are related and
both of them are weakly related to gene B - In literature, one would expect
- A,C appear together OR/AND
- A,B appear together
- B,C appear together
- How do we efficiently conclude that A,C are
actually related?
6Our Approach
- Using data mining / machine learning techniques
to extract useful information from biological
data - Different forms of data
- Flat-file data
- Microarray data
- Online literature abstracts
- Develop different forms of tools
- Layout extractor
- Hypergraph mining
- Similarity measure among sets of genes
7Roadmap
- Motivation
- Our Approach
- Current Work
- Learning Layouts of Flat-file Biological Datasets
- Exploratory Tools for Biological Data Analysis
- Proposed Work
- Deep Web Mining for Biological Data
- Semi-supervised Ranking
- Multiple-instance Learning
- Conclusion
8Learning Layout of a Flat-File
- In general intractable
- Try and learn the layout, have a domain expert
verify - Key issue what delimiters are being used ?
9Finding Delimiters
- Some knowledge from domain expert is required
(Semi-automatic) - Naïve approaches
- Frequency Counting
- Counts frequently occurring single tokens (word
separated by space) - Sequence Mining
- Counts frequently occurring sequence of tokens
10Assumptions
- Biological datasets are written for humans to
read - It is very unlikely that delimiters will be
scattered all around, in different places in a
line - Position of the possible delimiters might provide
useful information - Combination of positional and frequency
information might be a better choice
11Positional Weight
- Let P be the different positions in a line where
a token can appear - For each position i ? P, tot_seqji represents
total of token sequences of length j starting
at position i - For each position i ? P, tot_unique_seqji
represents total of unique token sequences of
length j starting at position i - For any tuple (i,j), p_ratio(i,j) is defined as
shown above - p_ratio(i,j) can be log normalized to get
positional weight, p_wt(i,j) with the property
p_wt(i,j) ? (0,1)
12Delimiter score (d_score)
- Frequency weight for any token sequence sji with
length j and starting at position i, f_wt(sji),
is obtained by log normalizing frequency f(sji) - Obviously, f_wt(sji) ? (0,1)
- Positional and frequency weight now can be
combined together to get d_score as follows, - d_score(sji) a p_wt(i,j) (1-a) f_wt(sji)
- Where a ?(0,1)
- Thus d_scrore has the following two properties,
- d_score(sji) ?(0,1)
- d_score(sji) gt d_score(sjk) implies sji is more
likely to be a delimiter than sjk
13Generating layout descriptor
- Once the delimiters are identified, an NFA can be
built scanning the whole database where,
delimiters are different states of the NFA - This NFA can be used to generate a layout
descriptor since it nicely represents optional
and repeating states - The following figures shows an NFA, where A, B,
C, D, and E are delimiters with B being an
optional delimiter and C D being a repeating
delimiters
14Results
- By suitably varying a, a tight superset of
possible delimiters are found - A domain expert can then help to identify the
true delimiters - Results from 3 different flat file datasets are
as follows
15Comparison with naïve approaches
- d_score based approach definitely does a better
job as compared to the naïve approaches - The following table clearly shows the improvement
16Realistic Situation
- The task of identifying complete list of correct
delimiters is difficult - Most likely we will end up with getting an
incomplete list of delimiters - The delimiters which does not appear in every
data record (optional) are the ones to be
possibly missed
17Identifying Optional Delimiters
- Given a list of incomplete delimiters how can we
identify optional delimiters, if any? - Build a NFA based on given incomplete information
- Perform clustering to identify possible crucial
delimiters - Perform contrast analysis
18Crucial delimiter
- A delimiter is considered crucial, if missing
delimiters will appear immediately following
these delimiters - The goal is to create two clusters,
- one having delimiters which are not crucial
- The other one having crucial delimiters
19Identifying crucial delimitersA few definitions
- Succ(X) Set of delimiters that can immediately
follow X - Dist_App of groups of occurrences of X based
on of text lines between X and immediately next
delimiter - Info_Tuple(nXi,fXi,tXi) Information for each
Dist_App - Info_Tuple_List Lx For any X, list of all
possible Info_Tuple.
20Metric for clustering
- rXf is likely to be low if an optional delimiter
appears immediately after X, and high otherwise - Choose a suitable cut-off value rc and assign
delimiters to different groups as follows,- - If rXf lt rc, assign X to a group containing
possible crucial delimiters - Else assign X to the group containing non crucial
delimiters
21Observations and Facts
- Missing optional delimiters can appear
immediately after crucial delimiters ONLY - Non-crucial delimiters can be pruned away
- Consider two Info_Tuples (nX1, fX1 ,tX1) and
(nX2, fX2 ,tX2) in LX - If a missing delimiter appears immediately after
the appearance corresponding to the first tuple
but not the second one,- - nX1 gt nX2
- Missing delimiter will appear in tX1 but not in
tX2
22A hypothetical example illustrating Contrast
Analysis
- Suppose, X is a crucial delimiter having 2
Info_tuples, L1 and L2 , as follows, - L1(50, 20, l1 .txt)
- L2(20, 12, l2 .txt)
- Sequence mining on l1 .txt and l2 .txt yields two
sets of frequently occurring sequences, S1 and
S2 , as follows, - S1 f1 , f5 , f6 , f8 , f13 , f21
- S2 f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21
- Since but , f5 is a possible
missing delimiter - f5 is a missing delimiter only if it has a high
d_score or is verified by a domain expert as a
valid delimiter
23Contrast Analysis
- For any i,j, if nXi gt nXj , look for frequently
occurring sequences in tXi and tXj, call them
fsXi and fsXj respectively - If there exists a frequent sequence fs such
that, but then, fs is quite
likely to be a possible delimiter - If fs has a fairly high d_score or identified by
a domain expert as valid delimiter add it to the
incomplete list as newly found delimiter
24Generalized Contrast Analysis
- In case of more than two Info_Tuples, identify
mean of all nXi values - Form a group by appending text from all
Info_Tuples, where - Form another group by appending text from all
Info_Tuples, where - Perform contrast analysis among all such possible
groups
25Another example illustrating Generalized Contrast
Analysis
- Suppose, X is a crucial delimiter having 3
Info_tuples, L1 , L2 , L3 , as follows, - L1(50, 20, l1 .txt)
- L2(20, 12, l2 .txt)
- L3(15, 10, l3 .txt)
- Mean number of lines,
- Append l2 .txt and l3 .txt , call it t2 .txt
- Sequence mining on l1 .txt and t2 .txt yields two
sets of frequently occurring sequences, S1 and
S2 , as follows, - S1 f1 , f5 , f6 , f8 , f13 , f21
- S2 f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21
- Since but , f5 is a possible
missing delimiter - f5 is a missing delimiter only if it has a high
d_score or is verified by a domain expert as a
valid delimiter
26Overall Algorithms
27Results Optional delimiters
28Results Non-optional Missing delimiters
- Even though designed for finding optional
delimiters, our algorithms works, in some cases,
for missing non-optional delimiters too - If a missing non-optional delimiter appears
exactly in the same location in each record, then
our algorithm fails - If a non-optional delimiter has a backward edge
coming from a delimiter that appears later in a
topologically sorted NFA then our algorithm works
29Roadmap
- Motivation
- Our Approach
- Current Work
- Learning Layouts of Flat-file Biological Datasets
- Exploratory Tools for Biological Data Analysis
- Proposed Work
- Deep Web Mining for Biological Data
- Semi-supervised Ranking
- Multiple-instance Learning
- Conclusion
30Hypergraph Mining
- Basic Motivation
- To find useful Transitive Relation
(hypergraphs) among genes - Example (Gene-Disease Relationship)
- Gene A is related to a gene B
- Gene B is related to a gene C
- Is Gene A related to Gene C ?
- Gene Source
- Microarray Experiments
- Information Source
- Online Literature abstracts
31Formal Problem Definition
- Given
- A dictionary KT of keywords
- A dictionary KM of user provided key words
(KT?KM) - Collection of literature abstracts,- each
abstract is represented as a set of keywords - Task
- To find hyperedges exceeding user defined
threshold, each of which involves a set of key
words from KM and are potentially connected by
another set of linking words from KT-KM
32Modeling
- Purpose
- To use a similar approach as frequent itemset
mining - Define
- total weightsupport cross support
- Support set of keywords appear together in one
document - Cross support set of keywords can be partitioned
so that each partition appears in different
document - Issues
- Since downclosure property does not hold for
total weight modified downclosure property can be
defined
33Idea
- Support satisfies downclosure property
- Let X be a set, O be its power set. A function f
O ?R satisfies downclosure property if for all
A,B ? O , A ? B ,f(B)gtf(A) - Cross support can be designed to be restricted
below a particular value, i.e., it is bounded - Form a function h as addition of two functions
hfg - f satisfies downclosure property
- g is bounded
- h satisfies modified down closure property
- For any ?0, if h(Kn) ? then f(Kn-1)
max0,(?-sup(g)) - This property can be used to devise efficient
algorithm
34Results
35Similarity Measure among sets of genes
- Each file containing gene names can be considered
as a Discrete Random Variable (DRV) - Each such DRV can take several values (gene
names) - For two such files X,Y and for any pair (x,y),
x?X and y?Y, p(x,y) can be computed from online
abstracts based on co-occurrence - Now defining Zg(X,Y), Z is a RV
- Expectation of Z can be used as a similarity
measure - Different g gives rise to different similarity
measure
36Roadmap
- Motivation
- Our Approach
- Current Work
- Learning Layouts of Flat-file Biological Datasets
- Exploratory Tools for Biological Data Analysis
- Proposed Work
- Deep Web Mining for Biological Data
- Semi-supervised Ranking
- Multiple-instance Learning
- Conclusion
37Query Planning for Deepweb Mining
- A huge source of online biological information is
available in the form of deepweb - An online query form query form needs to be
filled out - Required information is available by filling out
may such forms from different websites - There might be some dependency among these forms
- Requires Redundancy elimination
38Roadmap
- Motivation
- Our Approach
- Current Work
- Learning Layouts of Flat-file Biological Datasets
- Exploratory Tools for Biological Data Analysis
- Proposed Work
- Deep Web Mining for Biological Data
- Semi-supervised Ranking
- Multiple-instance Learning
- Conclusion
39Semi-supervised Ranking
- Ranking
- Given a training set of examples with labels/pair
wise relationships - Task is to rank an unseen test set, i.e. to get a
permutation so that relevant examples are ranked
higher than irrelevant ones - This corresponds to learning a ranking function
- Semi-supervised Ranking
- Incorporating unlabeled examples to learn the
ranking function - Out of sample extension
40Potential Application
- Following a microarray experiment it might be
possible to guess if gene A is more important
than gene B involved in the experiment - However all possible order relationship is time
consuming end error prone - Thus, from a small set of order relationship and
using other genes from the experiment as
unlabeled data a semi-supervised ranking function
can be learned
41Roadmap
- Motivation
- Our Approach
- Current Work
- Learning Layouts of Flat-file Biological Datasets
- Exploratory Tools for Biological Data Analysis
- Proposed Work
- Deep Web Mining for Biological Data
- Semi-supervised Ranking
- Multiple-instance Learning
- Conclusion
42Multiple Instance Learning
- Instead of instance-label pair (x,y), bag-label
pair (B,y) is provided as training data - A bag contains multiple instances
- A bag label is negative, if each instance in the
bag has negative label - A bag label is positive, if there exists at least
one instance with positive label - Given an unseen bag, the task is to predict its
label
43Potential Application
- Following a microarray experiment it might be
possible to form bags of genes with appropriate
labels - From different biological labs doing similar
experiments, many such bags can be obtained to
use as training data - Before, designing a new microarray experiment,
gene set can be selected based on multiple
instance learning
44Summary
- Use of data mining /machine learning techniques
to extract information for biological data - Work done
- Learning layouts of flat-file biological datasets
- Hypergraph Mining
- Similarity Measure among sets of genes
- Proposed Work
- Study and application of machine learning
techniques