Counting Suffix Arrays and Strings - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Counting Suffix Arrays and Strings

Description:

Recursive enumeration of Eulerian numbers. for n d, and. Spire 2005 - Jens ... with d R -descents is the Eulerian number . Spire 2005 - Jens Stoye. Slide 18 ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 24
Provided by: shue1
Category:

less

Transcript and Presenter's Notes

Title: Counting Suffix Arrays and Strings


1
Counting Suffix Arrays and Strings
2
Suffix Array Data Structure
Suffix Array lexicographically sorted list of
all suffixes
13 - 12 - C 10 - CTC 5 - CTCTTCTC 7 - CTTCTC
2 - CTTCTCTTCTC 11 - TC 9 - TCTC
4 - TCTCTTCTC 6 - TCTTCTC 1 - TCTTCTCTTCTC 8
- TTCTC 3 - TTCTCTTCTC
3
Overview
  • Classify strings sharing same suffix array
  • Counting strings sharing same suffix array
  • Counting suffix arrays? Lower bound suffix array
    compression
  • Summation identities

4
1. Classify Strings for Suffix Array
  • t - string of length n,
  • P - permutation of 1,..., n,
  • R - inverse of P.
  • Theorem
  • P is the suffix array of t if and only if
  • for all i ?1,...,n
  • tPi ? tPi1 and
  • tPi tPi1 ? RPi1 ? RPi11
  • same as
  • RPi1 gt RPi11 ? tPi lt tPi1

5
1. Classify Strings for Suffix Array
a) tPi ? tPi1 and b) RPi1 gt
RPi11 ? tPi lt tPi1
Text to be indexed
R-descent
6
1. Classify Strings for Suffix Array
Equivalences between strings
Text to be indexed
(order-equivalent)
(order-distinct)
7
2. Counting Strings for Suffix Array
Text to be indexed
Base string
Non-decreasing sequences
8
2. Counting Strings for Suffix Array
  • Suffix array P of length n with d R-descents.
  • Number of strings over alphabet of size a for P
  • Number of non-decreasing sequences overa-d
    elements

9
2. Counting Strings for Suffix Array
  • Suffix array P of length n with d R-descents.
  • Number of strings composed of exactly k distinct
    characters for P is

10
2. Counting Strings for Suffix Array
  • Number of strings over alphabet size 20 for
    suffix arrays of length n with 10 R-descents

11
2. Counting Strings for Suffix Array
  • Suffix array P of length n with d R-descents
  • Number of order-distinct strings over alphabet of
    size a is
  • Number of order-distinct strings where all k
    distinct characters must appear is

12
3. Counting Suffix Arrays
  • Definition
  • Let P permutation of 1,..., n.
  • Position i?1,...,n-1 is a permutation descent
  • if Pi gt Pi1.
  • Definition
  • The Eulerian number gives the number of
  • permutations of 1,...,n with exactly d
  • permutation descents.

13
3. Counting Suffix Arrays
  • Well-known fact
  • Recursive enumeration of Eulerian numbers
  • ,
  • for n ? d, and

14
3. Counting Suffix Arrays
  • Definition
  • Let A(n,d) be the number of permutations of
    length n with d R-descents.
  • Observation
  • A(n,0) 1
  • A(n,d) 0 for n ? d
  • see next

15
3. Counting Suffix Arrays
Text to be indexed
(d1) possible positions without additional
R-descent
16
3. Counting Suffix Arrays
Text to be indexed
(d1) possible positions without additional
R-descent
17
3. Counting Suffix Arrays
  • Together
  • A(n,0) 1,
  • A(n,d) 0 for n ? d, and
  • A(n,d) (d1) A(n-1,d) (n-d) A(n-1,d-1)
  • Theorem
  • The number A(n,d) of permutations of length n
  • with d R-descents is the Eulerian number
    .

18
3. Counting Suffix Arrays
  • The number of distinct suffix arrays of length n
    for strings over alphabet of size a
  • Lower bound for compressibility of suffix arrays
    in the Kolmogorov sense

19
3. Counting Suffix Arrays
  • Number of distinct suffix arrays of length n for
    strings over alphabet of size 20

20
3. Counting Suffix Arrays
  • Number of distinct suffix arrays of length n for
    strings over alphabet of size 4

21
4. Summation Identities
  • Worpitzkis identity by summing up the number of
    strings of length n for each suffix array
  • Summation rule for Eulerian numbers to generate
    the Stirling numbers of second kind

22
Summary
  • Constructive proofs to count strings sharing the
    same suffix array
  • Constructive proof to count distinct suffix
    arrays yielding lower bound for suffix array
    compression
  • Constructive proofs for Worpitzkis identity and
    the summation rule of Eulerian numbers to count
    Stirling numbers of second kind

23
Outlook
  • Efficient enumeration algorithm for suffix arrays
  • Compressed suffix arrays for fast querying in
    bioinformatics applications
  • Average case analysis under non-uniform model

24
  • Thank you for
  • your attention!
Write a Comment
User Comments (0)
About PowerShow.com