Output Sensitive Algorithm for Finding Similar Objects - PowerPoint PPT Presentation

About This Presentation
Title:

Output Sensitive Algorithm for Finding Similar Objects

Description:

Motivation: Analyzing Huge Data. Recent information technology gave us ... Efficient spin out heuristics for practice. Genome analyze system. Future works ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 21
Provided by: researc46
Category:

less

Transcript and Presenter's Notes

Title: Output Sensitive Algorithm for Finding Similar Objects


1
Output Sensitive Algorithm for Finding Similar
Objects
  • Takeaki Uno
  • National Institute of Informatics,
  • The Graduate University for
  • Advanced Studies (Sokendai)

Jul/2/2007 Combinatorial Algorithms Day
2
Motivation Analyzing Huge Data
  • Recent information technology gave us many huge
    database
  • - Web, genome, POS, log,
  • "Construction" and "keyword search" can be done
    efficiently
  • The next step is analysis capture features of
    the data
  • - statistics, such as size, rows, density,
    attributes, distribution
  • Can we get more?
  • ? look at (simple) local structures
  • but keep simple and basic

Database
ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCA
AATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT

Results of experiments
genome
3
Our Focus
  • Find all pairs of similar objects (or
    structures)
  • (or binary relation instead of similarity)
  • Maybe, this is very basic and fundamental
  • ? There would be many applications
  • - finding global similar structure,
  • - constructing neighbor graphs,
  • - detect locally dense structures (groups of
    related objects)

In this talk, we look at the strings
4
Existing Studies
  • There are so many studies on similarity search
    (homology search)
  • ? Given a database, construct a data structure
    which enable us to find the objects similar to
    the given a query object quickly
  • - strings with Hamming distance, edit distance
  • - points in plane (k-d trees), Euclidian space
  • - sets
  • - constructing neighbor graphs (for smaller
    dimensions)
  • - genome sequence comparison (heuristics)
  • Both exact and approximate approaches
  • All pairs comparison is not popular

5
Our Problem
  • Problem
  • For given a database composed of n strings of
    the fixed same length l, and a threshold d,
  • find all the pairs of strings such that the
    Hamming distance of the two strings is at most d

ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA
...
ATGCCGCG , AAGCCGCC GCCTCTAT ,
GCTTCTAA TGTAATGA , GGTAATGG
...
6
Trivial Bound of the Complexity
  • If all the strings are exactly the same, we
    have to output all the pairs, thus take T(n2)
    time
  • ? simple all pairs comparison of O(l n2) time
    is optimal,
  • if l is a fixed constant
  • ? Is there no improvement?
  • In practice, we would analyze only when output
    is small, otherwise the analysis is non-sense
  • ? consider complexity in the term of
  • the output size

M outputs
We propose O(2l(nlM)) time algorithm
7
Basic Idea Fixed Position Subproblem
  • Consider the following subproblem
  • For given l-d positions of letters, find all
    pairs of strings with Hamming distance at most d
    such that
  • "the letters on the l-d positions are the same"
  • Ex) 2nd, 4th, 5th positions of strings with
    length 5
  • We can solve by "radix sort" by letters on the
    positions, in O(l n) time.

8
Examine All Cases
  • Solve the subproblem for all combinations of
    the positions
  • ? If distance of two strings S1 and S2 is at most
    2,
  • letters on l-d positions (say P) are the
    same
  • ? In at least one combination, S1 and S2 is found
  • (in the subproblem of combination P)
  • combinations is lCd. When l5 and d2, it is
    10
  • ? Computation is "radix sorts a", O(lCd ln )
    time for sorting
  • ? Use branch-and-bound to radix sort, in O(lCd
    n ) time

9
Exercise
  • Find all pairs of strings with Hamming distance
    at most 1

A B C A B D A C C E F G F F G A F G G A B
G A B A B C A B D A C C E F G F F G A F G
A B C A C C A B D A F G E F G F F G G A B
A B C A B D A C C A F G E F G F F G G A B
10
Duplication How long is "a"
  • If two strings S1 and S2 are exactly the same,
    their combination is found in all subproblems,
    lCd times
  • ? If we allow the duplications, "a" needs O(M
    lCd ) time
  • ? To avoid the duplication, use "canonical
    positions"

11
Avoid Duplications by Canonical Positions
  • For two strings S1 and S2, their canonical
    positions are the first l-d positions of the same
    letters
  • Only we output the pair S1 and S2 only in the
    subproblem of their canonical positions
  • Computation of canonical posisions takes O(d)
    time, "a" needs O(K d lCd ) time

Avoid duplications without keeping the solutions
in memory
O(lCd (ndM)) O(2l (n lM) ) time in total (
O(nM)) if l is a fixed constant )
12
In Practice
  • Is lCd small in practice?
  • ? In some case, yes (ex, genome sequences)
  • If we want to find strings with at most 10 of
    error
  • 20C2 190, 30C3 4060, 60C6
    50063860
  • maybe, large for (bit) large l
  • For dealing with (bit) large l, we use a
    variant of this algorithm

13
Partition to Blocks
  • Consider the partition of strings into k blocks
  • For given k-d positions of blocks, find all
    pairs of strings with distance at most d s. t.
    "the blocks on the positions are the same"
  • Radix sorts are done in O(kCd n) time
  • Ex) 2nd, 4th, 5th positions of blocks of strings
    of length 5

14
Small "a" is Expected
  • The Hamming distance of two strings may be
    larger than d, even if their k-d blocks are the
    same
  • ? In the worst case, "a" is not linear in
    output
  • However, if letters in k-d blocks are large
    enough, the strings having the same blocks are
    few
  • ? "a" is not large, in practice, in almost
    O(kCd n) time

15
Experiments l 20 and d 0,1,2,3
Prefixes of Y chromosome of Human Note PC with
Pentium M 1.1GHz, 256MB RAM
16
Comparison of Long Strings
  • Slice one of the long strings with overlaps
  • Partition the other long string without overlap
  • Compare all pairs
  • 1 draw a matrix intensity of a cell is
  • given by pairs inside
  • 2 draw a point if 3 pairs in an area
  • of length aand width ß
  • ? two substrings of length a have error of bit
  • less than k , they have at least some
  • short similar substrings

17
Comparison of Chromosome
  • Human 21st and chimpanzee 22nd chromosomes
  • Take strings of 30 letters from both, with
    overlaps
  • Intensity is given by pairs
  • White ? possibly similar
  • Black ? never similar
  • Grid lines detect "repetitions
  • of similar structures"

20 min. by PC
18
Homology Search on Chromosomes
  • Human X and mouse X chromosomes (150M strings for
    each)
  • take strings of 30
  • letters beginning at
  • every position
  • For human X,
  • Without overlaps
  • d2, k7
  • dots if 3 points are
  • in area of width 300
  • and length 3000

human X chr.
mouse X chr.
1 hour by PC
19
Extensions ???
  • Can we solve the problem for other objects?
  • (sets, sequences, graphs,)
  • For graphs, maybe yes, but not sure for the
    practical performance
  • For sets, Hamming distance is not preferable.
  • For large sets, many difference should be
    allowed.
  • For continuous objects, such as points in
    Euclidian space, we can hardly bound the
    complexity in the same way.
  • (In the discrete version, the neighbors are
    finite, actually
  • classified into constant number of groups)

20
Conclusion
  • Output sensitive algorithm for finding pairs of
    similar strings
  • ( in the term of Hamming distance)
  • Multiple-classification by positions to be the
    same
  • Using blocks to reduce the practical
    computation
  • Application to genome sequence comparison

Future works
Extension to other objects (sets, sequences,
graphs) Extension to continuous objects (points
in Euclidian space) Efficient spin out
heuristics for practice Genome analyze system
Write a Comment
User Comments (0)
About PowerShow.com