Bioinformatics, Data Integration and Machine Learning a Thesis Proposal

About This Presentation

Title:

Bioinformatics, Data Integration and Machine Learning a Thesis Proposal

Description:

Build a NFA based on given incomplete information Perform clustering to identify possible ... tools Layout extractor Hypergraph mining Similarity measure ... – PowerPoint PPT presentation

Number of Views:623

Avg rating:3.0/5.0

Slides: 45

Provided by: jeff142

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics, Data Integration and Machine Learning a Thesis Proposal

1
Bioinformatics, Data Integration and Machine
Learninga Thesis Proposal

Kaushik Sinha
Supervisors Prof. Gagan Agrawal and Prof.
Mikhail Belkin

2
Roadmap

Motivation
Our Approach
Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis
Proposed Work
Deep Web Mining for Biological Data
Semi-supervised Ranking
Multiple-instance Learning
Conclusion

3
Motivation

Integration is hard
Data explosion
Data size number of data sources
New analysis tools
Autonomous resources
Heterogeneous data representation various
interfaces
Frequent Updates
New trend web and grid services

4
Motivation contd

In recent years DNA microarry and other gene and
protein assays have become essential tools for
biologists
Next step of biological enquiry is to find out
What is known about these genes?
How are these genes related to each other or
other genes identified in similar studies?
However, major difficulties are
How do we extract key properties shared by a
candidate genes?
How do we generate reasonable hypothesis to
explain them?
How do we define and evaluate similarity between
sets of genes?

5
Motivating Example

Suppose after a micro array experiment a
biologist suspects that a small set of genes are
related to a disease
This can be confirmed by searching existing
literature
One would expect related genes to appear together
in literature
Due to sheer volume
Searching is time consuming and error prone
Some complications could arise as well
However, suppose Gene A and C are related and
both of them are weakly related to gene B
In literature, one would expect
A,C appear together OR/AND
A,B appear together
B,C appear together
How do we efficiently conclude that A,C are
actually related?

6
Our Approach

Using data mining / machine learning techniques
to extract useful information from biological
data
Different forms of data
Flat-file data
Microarray data
Online literature abstracts
Develop different forms of tools
Layout extractor
Hypergraph mining
Similarity measure among sets of genes

7
Roadmap

Motivation
Our Approach
Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis
Proposed Work
Deep Web Mining for Biological Data
Semi-supervised Ranking
Multiple-instance Learning
Conclusion

8
Learning Layout of a Flat-File

In general intractable
Try and learn the layout, have a domain expert
verify
Key issue what delimiters are being used ?

9
Finding Delimiters

Some knowledge from domain expert is required
(Semi-automatic)
Naïve approaches
Frequency Counting
Counts frequently occurring single tokens (word
separated by space)
Sequence Mining
Counts frequently occurring sequence of tokens

10
Assumptions

Biological datasets are written for humans to
read
It is very unlikely that delimiters will be
scattered all around, in different places in a
line
Position of the possible delimiters might provide
useful information
Combination of positional and frequency
information might be a better choice

11
Positional Weight

Let P be the different positions in a line where
a token can appear
For each position i ? P, tot_seqji represents
total of token sequences of length j starting
at position i
For each position i ? P, tot_unique_seqji
represents total of unique token sequences of
length j starting at position i
For any tuple (i,j), p_ratio(i,j) is defined as
shown above
p_ratio(i,j) can be log normalized to get
positional weight, p_wt(i,j) with the property
p_wt(i,j) ? (0,1)

12
Delimiter score (d_score)

Frequency weight for any token sequence sji with
length j and starting at position i, f_wt(sji),
is obtained by log normalizing frequency f(sji)
Obviously, f_wt(sji) ? (0,1)
Positional and frequency weight now can be
combined together to get d_score as follows,
d_score(sji) a p_wt(i,j) (1-a) f_wt(sji)
Where a ?(0,1)
Thus d_scrore has the following two properties,
d_score(sji) ?(0,1)
d_score(sji) gt d_score(sjk) implies sji is more
likely to be a delimiter than sjk

13
Generating layout descriptor

Once the delimiters are identified, an NFA can be
built scanning the whole database where,
delimiters are different states of the NFA
This NFA can be used to generate a layout
descriptor since it nicely represents optional
and repeating states
The following figures shows an NFA, where A, B,
C, D, and E are delimiters with B being an
optional delimiter and C D being a repeating
delimiters

14
Results

By suitably varying a, a tight superset of
possible delimiters are found
A domain expert can then help to identify the
true delimiters
Results from 3 different flat file datasets are
as follows

15
Comparison with naïve approaches

d_score based approach definitely does a better
job as compared to the naïve approaches
The following table clearly shows the improvement

16
Realistic Situation

The task of identifying complete list of correct
delimiters is difficult
Most likely we will end up with getting an
incomplete list of delimiters
The delimiters which does not appear in every
data record (optional) are the ones to be
possibly missed

17
Identifying Optional Delimiters

Given a list of incomplete delimiters how can we
identify optional delimiters, if any?
Build a NFA based on given incomplete information
Perform clustering to identify possible crucial
delimiters
Perform contrast analysis

18
Crucial delimiter

A delimiter is considered crucial, if missing
delimiters will appear immediately following
these delimiters
The goal is to create two clusters,
one having delimiters which are not crucial
The other one having crucial delimiters

19
Identifying crucial delimitersA few definitions

Succ(X) Set of delimiters that can immediately
follow X
Dist_App of groups of occurrences of X based
on of text lines between X and immediately next
delimiter
Info_Tuple(nXi,fXi,tXi) Information for each
Dist_App
Info_Tuple_List Lx For any X, list of all
possible Info_Tuple.

20
Metric for clustering

rXf is likely to be low if an optional delimiter
appears immediately after X, and high otherwise
Choose a suitable cut-off value rc and assign
delimiters to different groups as follows,-
If rXf lt rc, assign X to a group containing
possible crucial delimiters
Else assign X to the group containing non crucial
delimiters

21
Observations and Facts

Missing optional delimiters can appear
immediately after crucial delimiters ONLY
Non-crucial delimiters can be pruned away
Consider two Info_Tuples (nX1, fX1 ,tX1) and
(nX2, fX2 ,tX2) in LX
If a missing delimiter appears immediately after
the appearance corresponding to the first tuple
but not the second one,-
nX1 gt nX2
Missing delimiter will appear in tX1 but not in
tX2

22
A hypothetical example illustrating Contrast
Analysis

Suppose, X is a crucial delimiter having 2
Info_tuples, L1 and L2 , as follows,
L1(50, 20, l1 .txt)
L2(20, 12, l2 .txt)
Sequence mining on l1 .txt and l2 .txt yields two
sets of frequently occurring sequences, S1 and
S2 , as follows,
S1 f1 , f5 , f6 , f8 , f13 , f21
S2 f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21
Since but , f5 is a possible
missing delimiter
f5 is a missing delimiter only if it has a high
d_score or is verified by a domain expert as a
valid delimiter

23
Contrast Analysis

For any i,j, if nXi gt nXj , look for frequently
occurring sequences in tXi and tXj, call them
fsXi and fsXj respectively
If there exists a frequent sequence fs such
that, but then, fs is quite
likely to be a possible delimiter
If fs has a fairly high d_score or identified by
a domain expert as valid delimiter add it to the
incomplete list as newly found delimiter

24
Generalized Contrast Analysis

In case of more than two Info_Tuples, identify
mean of all nXi values
Form a group by appending text from all
Info_Tuples, where
Form another group by appending text from all
Info_Tuples, where
Perform contrast analysis among all such possible
groups

25
Another example illustrating Generalized Contrast
Analysis

Suppose, X is a crucial delimiter having 3
Info_tuples, L1 , L2 , L3 , as follows,
L1(50, 20, l1 .txt)
L2(20, 12, l2 .txt)
L3(15, 10, l3 .txt)
Mean number of lines,
Append l2 .txt and l3 .txt , call it t2 .txt
Sequence mining on l1 .txt and t2 .txt yields two
sets of frequently occurring sequences, S1 and
S2 , as follows,
S1 f1 , f5 , f6 , f8 , f13 , f21
S2 f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21
Since but , f5 is a possible
missing delimiter
f5 is a missing delimiter only if it has a high
d_score or is verified by a domain expert as a
valid delimiter

26
Overall Algorithms
27
Results Optional delimiters

Pruning

28
Results Non-optional Missing delimiters

Even though designed for finding optional
delimiters, our algorithms works, in some cases,
for missing non-optional delimiters too
If a missing non-optional delimiter appears
exactly in the same location in each record, then
our algorithm fails
If a non-optional delimiter has a backward edge
coming from a delimiter that appears later in a
topologically sorted NFA then our algorithm works

29
Roadmap

Motivation
Our Approach
Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis
Proposed Work
Deep Web Mining for Biological Data
Semi-supervised Ranking
Multiple-instance Learning
Conclusion

30
Hypergraph Mining

Basic Motivation
To find useful Transitive Relation
(hypergraphs) among genes
Example (Gene-Disease Relationship)
Gene A is related to a gene B
Gene B is related to a gene C
Is Gene A related to Gene C ?
Gene Source
Microarray Experiments
Information Source
Online Literature abstracts

31
Formal Problem Definition

Given
A dictionary KT of keywords
A dictionary KM of user provided key words
(KT?KM)
Collection of literature abstracts,- each
abstract is represented as a set of keywords
Task
To find hyperedges exceeding user defined
threshold, each of which involves a set of key
words from KM and are potentially connected by
another set of linking words from KT-KM

32
Modeling

Purpose
To use a similar approach as frequent itemset
mining
Define
total weightsupport cross support
Support set of keywords appear together in one
document
Cross support set of keywords can be partitioned
so that each partition appears in different
document
Issues
Since downclosure property does not hold for
total weight modified downclosure property can be
defined

33
Idea

Support satisfies downclosure property
Let X be a set, O be its power set. A function f
O ?R satisfies downclosure property if for all
A,B ? O , A ? B ,f(B)gtf(A)
Cross support can be designed to be restricted
below a particular value, i.e., it is bounded
Form a function h as addition of two functions
hfg
f satisfies downclosure property
g is bounded
h satisfies modified down closure property
For any ?0, if h(Kn) ? then f(Kn-1)
max0,(?-sup(g))
This property can be used to devise efficient
algorithm

34
Results
35
Similarity Measure among sets of genes

Each file containing gene names can be considered
as a Discrete Random Variable (DRV)
Each such DRV can take several values (gene
names)
For two such files X,Y and for any pair (x,y),
x?X and y?Y, p(x,y) can be computed from online
abstracts based on co-occurrence
Now defining Zg(X,Y), Z is a RV
Expectation of Z can be used as a similarity
measure
Different g gives rise to different similarity
measure

36
Roadmap

Motivation
Our Approach
Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis
Proposed Work
Deep Web Mining for Biological Data
Semi-supervised Ranking
Multiple-instance Learning
Conclusion

37
Query Planning for Deepweb Mining

A huge source of online biological information is
available in the form of deepweb
An online query form query form needs to be
filled out
Required information is available by filling out
may such forms from different websites
There might be some dependency among these forms
Requires Redundancy elimination

38
Roadmap

Motivation
Our Approach
Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis
Proposed Work
Deep Web Mining for Biological Data
Semi-supervised Ranking
Multiple-instance Learning
Conclusion

39
Semi-supervised Ranking

Ranking
Given a training set of examples with labels/pair
wise relationships
Task is to rank an unseen test set, i.e. to get a
permutation so that relevant examples are ranked
higher than irrelevant ones
This corresponds to learning a ranking function
Semi-supervised Ranking
Incorporating unlabeled examples to learn the
ranking function
Out of sample extension

40
Potential Application

Following a microarray experiment it might be
possible to guess if gene A is more important
than gene B involved in the experiment
However all possible order relationship is time
consuming end error prone
Thus, from a small set of order relationship and
using other genes from the experiment as
unlabeled data a semi-supervised ranking function
can be learned

41
Roadmap

Motivation
Our Approach
Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis
Proposed Work
Deep Web Mining for Biological Data
Semi-supervised Ranking
Multiple-instance Learning
Conclusion

42
Multiple Instance Learning

Instead of instance-label pair (x,y), bag-label
pair (B,y) is provided as training data
A bag contains multiple instances
A bag label is negative, if each instance in the
bag has negative label
A bag label is positive, if there exists at least
one instance with positive label
Given an unseen bag, the task is to predict its
label

43
Potential Application

Following a microarray experiment it might be
possible to form bags of genes with appropriate
labels
From different biological labs doing similar
experiments, many such bags can be obtained to
use as training data
Before, designing a new microarray experiment,
gene set can be selected based on multiple
instance learning

44
Summary

Use of data mining /machine learning techniques
to extract information for biological data
Work done
Learning layouts of flat-file biological datasets
Hypergraph Mining
Similarity Measure among sets of genes
Proposed Work
Study and application of machine learning
techniques

Write a Comment

User Comments (0)

About PowerShow.com

Bioinformatics, Data Integration and Machine Learning a Thesis Proposal - PowerPoint PPT Presentation

Bioinformatics, Data Integration and Machine Learning a Thesis Proposal

Build a NFA based on given incomplete information Perform clustering to identify possible ... tools Layout extractor Hypergraph mining Similarity measure ... – PowerPoint PPT presentation