Bioinformatics, Data Integration and Machine Learning a Thesis Proposal - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Bioinformatics, Data Integration and Machine Learning a Thesis Proposal

Description:

Build a NFA based on given incomplete information Perform clustering to identify possible ... tools Layout extractor Hypergraph mining Similarity measure ... – PowerPoint PPT presentation

Number of Views:623
Avg rating:3.0/5.0
Slides: 45
Provided by: jeff142
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics, Data Integration and Machine Learning a Thesis Proposal


1
Bioinformatics, Data Integration and Machine
Learninga Thesis Proposal
  • Kaushik Sinha
  • Supervisors Prof. Gagan Agrawal and Prof.
    Mikhail Belkin

2
Roadmap
  • Motivation
  • Our Approach
  • Current Work
  • Learning Layouts of Flat-file Biological Datasets
  • Exploratory Tools for Biological Data Analysis
  • Proposed Work
  • Deep Web Mining for Biological Data
  • Semi-supervised Ranking
  • Multiple-instance Learning
  • Conclusion

3
Motivation
  • Integration is hard
  • Data explosion
  • Data size number of data sources
  • New analysis tools
  • Autonomous resources
  • Heterogeneous data representation various
    interfaces
  • Frequent Updates
  • New trend web and grid services

4
Motivation contd
  • In recent years DNA microarry and other gene and
    protein assays have become essential tools for
    biologists
  • Next step of biological enquiry is to find out
  • What is known about these genes?
  • How are these genes related to each other or
    other genes identified in similar studies?
  • However, major difficulties are
  • How do we extract key properties shared by a
    candidate genes?
  • How do we generate reasonable hypothesis to
    explain them?
  • How do we define and evaluate similarity between
    sets of genes?

5
Motivating Example
  • Suppose after a micro array experiment a
    biologist suspects that a small set of genes are
    related to a disease
  • This can be confirmed by searching existing
    literature
  • One would expect related genes to appear together
    in literature
  • Due to sheer volume
  • Searching is time consuming and error prone
  • Some complications could arise as well
  • However, suppose Gene A and C are related and
    both of them are weakly related to gene B
  • In literature, one would expect
  • A,C appear together OR/AND
  • A,B appear together
  • B,C appear together
  • How do we efficiently conclude that A,C are
    actually related?

6
Our Approach
  • Using data mining / machine learning techniques
    to extract useful information from biological
    data
  • Different forms of data
  • Flat-file data
  • Microarray data
  • Online literature abstracts
  • Develop different forms of tools
  • Layout extractor
  • Hypergraph mining
  • Similarity measure among sets of genes

7
Roadmap
  • Motivation
  • Our Approach
  • Current Work
  • Learning Layouts of Flat-file Biological Datasets
  • Exploratory Tools for Biological Data Analysis
  • Proposed Work
  • Deep Web Mining for Biological Data
  • Semi-supervised Ranking
  • Multiple-instance Learning
  • Conclusion

8
Learning Layout of a Flat-File
  • In general intractable
  • Try and learn the layout, have a domain expert
    verify
  • Key issue what delimiters are being used ?

9
Finding Delimiters
  • Some knowledge from domain expert is required
    (Semi-automatic)
  • Naïve approaches
  • Frequency Counting
  • Counts frequently occurring single tokens (word
    separated by space)
  • Sequence Mining
  • Counts frequently occurring sequence of tokens

10
Assumptions
  • Biological datasets are written for humans to
    read
  • It is very unlikely that delimiters will be
    scattered all around, in different places in a
    line
  • Position of the possible delimiters might provide
    useful information
  • Combination of positional and frequency
    information might be a better choice

11
Positional Weight
  • Let P be the different positions in a line where
    a token can appear
  • For each position i ? P, tot_seqji represents
    total of token sequences of length j starting
    at position i
  • For each position i ? P, tot_unique_seqji
    represents total of unique token sequences of
    length j starting at position i
  • For any tuple (i,j), p_ratio(i,j) is defined as
    shown above
  • p_ratio(i,j) can be log normalized to get
    positional weight, p_wt(i,j) with the property
    p_wt(i,j) ? (0,1)

12
Delimiter score (d_score)
  • Frequency weight for any token sequence sji with
    length j and starting at position i, f_wt(sji),
    is obtained by log normalizing frequency f(sji)
  • Obviously, f_wt(sji) ? (0,1)
  • Positional and frequency weight now can be
    combined together to get d_score as follows,
  • d_score(sji) a p_wt(i,j) (1-a) f_wt(sji)
  • Where a ?(0,1)
  • Thus d_scrore has the following two properties,
  • d_score(sji) ?(0,1)
  • d_score(sji) gt d_score(sjk) implies sji is more
    likely to be a delimiter than sjk

13
Generating layout descriptor
  • Once the delimiters are identified, an NFA can be
    built scanning the whole database where,
    delimiters are different states of the NFA
  • This NFA can be used to generate a layout
    descriptor since it nicely represents optional
    and repeating states
  • The following figures shows an NFA, where A, B,
    C, D, and E are delimiters with B being an
    optional delimiter and C D being a repeating
    delimiters

14
Results
  • By suitably varying a, a tight superset of
    possible delimiters are found
  • A domain expert can then help to identify the
    true delimiters
  • Results from 3 different flat file datasets are
    as follows

15
Comparison with naïve approaches
  • d_score based approach definitely does a better
    job as compared to the naïve approaches
  • The following table clearly shows the improvement

16
Realistic Situation
  • The task of identifying complete list of correct
    delimiters is difficult
  • Most likely we will end up with getting an
    incomplete list of delimiters
  • The delimiters which does not appear in every
    data record (optional) are the ones to be
    possibly missed

17
Identifying Optional Delimiters
  • Given a list of incomplete delimiters how can we
    identify optional delimiters, if any?
  • Build a NFA based on given incomplete information
  • Perform clustering to identify possible crucial
    delimiters
  • Perform contrast analysis

18
Crucial delimiter
  • A delimiter is considered crucial, if missing
    delimiters will appear immediately following
    these delimiters
  • The goal is to create two clusters,
  • one having delimiters which are not crucial
  • The other one having crucial delimiters

19
Identifying crucial delimitersA few definitions
  • Succ(X) Set of delimiters that can immediately
    follow X
  • Dist_App of groups of occurrences of X based
    on of text lines between X and immediately next
    delimiter
  • Info_Tuple(nXi,fXi,tXi) Information for each
    Dist_App
  • Info_Tuple_List Lx For any X, list of all
    possible Info_Tuple.

20
Metric for clustering
  • rXf is likely to be low if an optional delimiter
    appears immediately after X, and high otherwise
  • Choose a suitable cut-off value rc and assign
    delimiters to different groups as follows,-
  • If rXf lt rc, assign X to a group containing
    possible crucial delimiters
  • Else assign X to the group containing non crucial
    delimiters

21
Observations and Facts
  • Missing optional delimiters can appear
    immediately after crucial delimiters ONLY
  • Non-crucial delimiters can be pruned away
  • Consider two Info_Tuples (nX1, fX1 ,tX1) and
    (nX2, fX2 ,tX2) in LX
  • If a missing delimiter appears immediately after
    the appearance corresponding to the first tuple
    but not the second one,-
  • nX1 gt nX2
  • Missing delimiter will appear in tX1 but not in
    tX2

22
A hypothetical example illustrating Contrast
Analysis
  • Suppose, X is a crucial delimiter having 2
    Info_tuples, L1 and L2 , as follows,
  • L1(50, 20, l1 .txt)
  • L2(20, 12, l2 .txt)
  • Sequence mining on l1 .txt and l2 .txt yields two
    sets of frequently occurring sequences, S1 and
    S2 , as follows,
  • S1 f1 , f5 , f6 , f8 , f13 , f21
  • S2 f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21
  • Since but , f5 is a possible
    missing delimiter
  • f5 is a missing delimiter only if it has a high
    d_score or is verified by a domain expert as a
    valid delimiter

23
Contrast Analysis
  • For any i,j, if nXi gt nXj , look for frequently
    occurring sequences in tXi and tXj, call them
    fsXi and fsXj respectively
  • If there exists a frequent sequence fs such
    that, but then, fs is quite
    likely to be a possible delimiter
  • If fs has a fairly high d_score or identified by
    a domain expert as valid delimiter add it to the
    incomplete list as newly found delimiter

24
Generalized Contrast Analysis
  • In case of more than two Info_Tuples, identify
    mean of all nXi values
  • Form a group by appending text from all
    Info_Tuples, where
  • Form another group by appending text from all
    Info_Tuples, where
  • Perform contrast analysis among all such possible
    groups

25
Another example illustrating Generalized Contrast
Analysis
  • Suppose, X is a crucial delimiter having 3
    Info_tuples, L1 , L2 , L3 , as follows,
  • L1(50, 20, l1 .txt)
  • L2(20, 12, l2 .txt)
  • L3(15, 10, l3 .txt)
  • Mean number of lines,
  • Append l2 .txt and l3 .txt , call it t2 .txt
  • Sequence mining on l1 .txt and t2 .txt yields two
    sets of frequently occurring sequences, S1 and
    S2 , as follows,
  • S1 f1 , f5 , f6 , f8 , f13 , f21
  • S2 f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21
  • Since but , f5 is a possible
    missing delimiter
  • f5 is a missing delimiter only if it has a high
    d_score or is verified by a domain expert as a
    valid delimiter

26
Overall Algorithms
27
Results Optional delimiters
  • Pruning

28
Results Non-optional Missing delimiters
  • Even though designed for finding optional
    delimiters, our algorithms works, in some cases,
    for missing non-optional delimiters too
  • If a missing non-optional delimiter appears
    exactly in the same location in each record, then
    our algorithm fails
  • If a non-optional delimiter has a backward edge
    coming from a delimiter that appears later in a
    topologically sorted NFA then our algorithm works

29
Roadmap
  • Motivation
  • Our Approach
  • Current Work
  • Learning Layouts of Flat-file Biological Datasets
  • Exploratory Tools for Biological Data Analysis
  • Proposed Work
  • Deep Web Mining for Biological Data
  • Semi-supervised Ranking
  • Multiple-instance Learning
  • Conclusion

30
Hypergraph Mining
  • Basic Motivation
  • To find useful Transitive Relation
    (hypergraphs) among genes
  • Example (Gene-Disease Relationship)
  • Gene A is related to a gene B
  • Gene B is related to a gene C
  • Is Gene A related to Gene C ?
  • Gene Source
  • Microarray Experiments
  • Information Source
  • Online Literature abstracts

31
Formal Problem Definition
  • Given
  • A dictionary KT of keywords
  • A dictionary KM of user provided key words
    (KT?KM)
  • Collection of literature abstracts,- each
    abstract is represented as a set of keywords
  • Task
  • To find hyperedges exceeding user defined
    threshold, each of which involves a set of key
    words from KM and are potentially connected by
    another set of linking words from KT-KM

32
Modeling
  • Purpose
  • To use a similar approach as frequent itemset
    mining
  • Define
  • total weightsupport cross support
  • Support set of keywords appear together in one
    document
  • Cross support set of keywords can be partitioned
    so that each partition appears in different
    document
  • Issues
  • Since downclosure property does not hold for
    total weight modified downclosure property can be
    defined

33
Idea
  • Support satisfies downclosure property
  • Let X be a set, O be its power set. A function f
    O ?R satisfies downclosure property if for all
    A,B ? O , A ? B ,f(B)gtf(A)
  • Cross support can be designed to be restricted
    below a particular value, i.e., it is bounded
  • Form a function h as addition of two functions
    hfg
  • f satisfies downclosure property
  • g is bounded
  • h satisfies modified down closure property
  • For any ?0, if h(Kn) ? then f(Kn-1)
    max0,(?-sup(g))
  • This property can be used to devise efficient
    algorithm

34
Results
35
Similarity Measure among sets of genes
  • Each file containing gene names can be considered
    as a Discrete Random Variable (DRV)
  • Each such DRV can take several values (gene
    names)
  • For two such files X,Y and for any pair (x,y),
    x?X and y?Y, p(x,y) can be computed from online
    abstracts based on co-occurrence
  • Now defining Zg(X,Y), Z is a RV
  • Expectation of Z can be used as a similarity
    measure
  • Different g gives rise to different similarity
    measure

36
Roadmap
  • Motivation
  • Our Approach
  • Current Work
  • Learning Layouts of Flat-file Biological Datasets
  • Exploratory Tools for Biological Data Analysis
  • Proposed Work
  • Deep Web Mining for Biological Data
  • Semi-supervised Ranking
  • Multiple-instance Learning
  • Conclusion

37
Query Planning for Deepweb Mining
  • A huge source of online biological information is
    available in the form of deepweb
  • An online query form query form needs to be
    filled out
  • Required information is available by filling out
    may such forms from different websites
  • There might be some dependency among these forms
  • Requires Redundancy elimination

38
Roadmap
  • Motivation
  • Our Approach
  • Current Work
  • Learning Layouts of Flat-file Biological Datasets
  • Exploratory Tools for Biological Data Analysis
  • Proposed Work
  • Deep Web Mining for Biological Data
  • Semi-supervised Ranking
  • Multiple-instance Learning
  • Conclusion

39
Semi-supervised Ranking
  • Ranking
  • Given a training set of examples with labels/pair
    wise relationships
  • Task is to rank an unseen test set, i.e. to get a
    permutation so that relevant examples are ranked
    higher than irrelevant ones
  • This corresponds to learning a ranking function
  • Semi-supervised Ranking
  • Incorporating unlabeled examples to learn the
    ranking function
  • Out of sample extension

40
Potential Application
  • Following a microarray experiment it might be
    possible to guess if gene A is more important
    than gene B involved in the experiment
  • However all possible order relationship is time
    consuming end error prone
  • Thus, from a small set of order relationship and
    using other genes from the experiment as
    unlabeled data a semi-supervised ranking function
    can be learned

41
Roadmap
  • Motivation
  • Our Approach
  • Current Work
  • Learning Layouts of Flat-file Biological Datasets
  • Exploratory Tools for Biological Data Analysis
  • Proposed Work
  • Deep Web Mining for Biological Data
  • Semi-supervised Ranking
  • Multiple-instance Learning
  • Conclusion

42
Multiple Instance Learning
  • Instead of instance-label pair (x,y), bag-label
    pair (B,y) is provided as training data
  • A bag contains multiple instances
  • A bag label is negative, if each instance in the
    bag has negative label
  • A bag label is positive, if there exists at least
    one instance with positive label
  • Given an unseen bag, the task is to predict its
    label

43
Potential Application
  • Following a microarray experiment it might be
    possible to form bags of genes with appropriate
    labels
  • From different biological labs doing similar
    experiments, many such bags can be obtained to
    use as training data
  • Before, designing a new microarray experiment,
    gene set can be selected based on multiple
    instance learning

44
Summary
  • Use of data mining /machine learning techniques
    to extract information for biological data
  • Work done
  • Learning layouts of flat-file biological datasets
  • Hypergraph Mining
  • Similarity Measure among sets of genes
  • Proposed Work
  • Study and application of machine learning
    techniques
Write a Comment
User Comments (0)
About PowerShow.com