Advanced Algorithms - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Advanced Algorithms

Description:

Comparing DNA sequences in studies of evolution of different species. Spell checkers. One of the measures of similarity is the edit ... Optimal Substructure ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 39
Provided by: sony65
Learn more at: http://www.cs.fsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Advanced Algorithms


1
Advanced Algorithms
  • Piyush Kumar
  • (Lecture 12 String Matching/Searching)

Welcome to COT5405
Source S. altenis
2
Computing Edit Distance
  • Two text strings are given X and Y
  • We want to quantify how similar they are
  • Comparing DNA sequences in studies of evolution
    of different species
  • Spell checkers
  • One of the measures of similarity is the edit
    distance between X and Y
  • (small distance lt---gt high similarity)

3
Edit Distance Definition
  • We want to convert X into Y by performing one
    of three operations
  • Delete a letter, insert a letter, or substitute
  • one letter for another.
  • E.g. X ACGGTTA can be converted toYCGTAT by
    deleting the 1st A, 2nd G, and substituting
    Alt--gtT in last two positions.
  • ACGGTTA
  • _CG_ TAT

4
Edit Distance Definition
  • We want to convert X into Y by performing one
    of three operations
  • Delete a letter, insert a letter, or substitute
  • one letter for another.
  • The minimum number of these operations that
    convert X into Y is called the edit distance
    between X and Y.

5
Edit Distance Optimal Substructure
  • Denote by E(i,j) the edit distance between the
    i-th prefix of X (x1 x2 xi) and the j-th prefix
    of Y (y1 y2 yj)
  • If xiyj, then E(i,j)E(i-1,j-1)
  • If xi¹yj,
  • Either substitute xi ?yj, (cost is 1 E(i-1,j-1)
    )
  • or delete xi (cost is 1 E(i-1,j) )
  • or insert yj (cost is 1 E(i,j-1) )
  • Decide which decision to do by comparing the
    three values, taking the minimum one.
  • Cut-and-paste argument

6
Edit Distance Computing
  • Let n be the length of the word X, and let m be
    the length of Y.
  • To compute Ei,j (the Edit distance of (Xi, Yj)
    ) we construct a 2-dim array (in Scheme vector
    of vectors)
  • of size (n1)x(m1).
  • We initialize the array at the left most column
    and topmost row E(i,0)i, E(0,j)j (the edit
    distance to an empty word).

7
  • To fill entry E(i,j), we need the three former
    values
  • E(i-1,j-1),E(i,j-1),E(i-1,j). Having these,
  • we use the recurrence we saw to fill E(i,j).
  • Desired value is E(n,m).
  • Observe conditions in the problem restrict
    sub-problems (What is the total number of
    sub-problems?)

8
X ACGGTTA Y CGTAT
9
Edit Distance Example
  • Lets do a dry run with X ACGGTTA, YCGTAT

10
Short Introductionto Search Engines
11
Applictions
  • ?

12
Typical Web Search Engine Architecture
Check for duplicates, store the documents
crawl the web
DocIds
user query
create an inverted index
Inverted index
Search engine servers
Show results To user
Courtesy R. Ramakrishnan
13
Goals
  • Speed
  • Space Efficiency
  • Accuracy The first item should be what I want
    to see?
  • Updates Periodic? Dynamic?

14
Typical Methods
  • Full Text scanning (egrep?)
  • Inverted File Indexing (Most common)
  • Signature Files
  • Vector Space Model

15
Types of queries
  • Boolean
  • Proximity? (Edit Distance?)
  • In relation to other documents.
  • FileType Keywords

Allow for Prefix matches? Wildcards? Edit
distance bounds. (egrep)
16
Common Tricks
  • Case Unfolding Tallahassee tallahassee.
  • Stemming Compress compressed compression
  • ( off-the shelf stemmers available for
    English)
  • Ignore words a, the, it, be,
  • Thesaurus fast rapid
  • (typically use available clustering)

17
Inverted File Index
  • Periodically rebuilt, static otherwise.
  • Documents are parsed to extract tokens. These are
    saved with the Document ID.

Doc 1
Doc 2
Now is the time for all good men to come to the
aid of their country
It was a dark and stormy night in the country
manor. The time was past midnight
18
How Inverted Files are Created
  • After all documents have been parsed the inverted
    file is sorted alphabetically.

19
How InvertedFiles are Created
  • Multiple term entries for a single document are
    merged.
  • Within-document term frequency information is
    compiled.

20
How Inverted Files are Created
  • Finally, the file can be split into
  • A Dictionary or Lexicon file
  • and
  • A Postings file

21
How Inverted Files are Created
  • Dictionary/Lexicon Postings

22
Why use Inverted Files?
  • Permits fast search for individual terms
  • For Boolean queries.
  • For statistical ranking algorithms.

23
Issues with Inverted files?
  • How to minimize the space taken by the postings
    list?
  • Access to the lexicon?
  • How to do union and intersection of postings.

24
Minimizing Space
  • Store postings with deltas
  • Original posting list 3,5,20,21,23
  • Delta Encoding 3,2,15,1,2
  • Use compression on delta encoding
  • Huffman, Arithmetic

25
Access to Lexicon?
  • Static
  • Sorted arrays.
  • Perfect Hashing
  • Dynamic
  • Tries
  • B-Trees

Prefix Matching?
26
Tries
Useful for ReTrieval First appearance 1959 Radix
Search?
Courtesy Tamassia Goodrich.
27
Preprocessing Strings
  • Preprocessing the pattern speeds up pattern
    matching queries
  • After preprocessing the pattern, KMPs algorithm
    performs pattern matching in time proportional to
    the text size
  • If the text is large, immutable and searched for
    often (e.g., works by Shakespeare), we may want
    to preprocess the text instead of the pattern
  • A trie is a compact data structure for
    representing a set of strings, such as all the
    words in a text
  • A tries supports pattern matching queries in time
    proportional to the pattern size

28
Standard Tries ( 11.3.1)
  • The standard trie for a set of strings S is an
    ordered tree such that
  • Each node but the root is labeled with a
    character
  • The children of a node are alphabetically ordered
  • The paths from the external nodes to the root
    yield the strings of S
  • Example standard trie for the set of strings
  • S bear, bell, bid, bull, buy, sell, stock,
    stop

29
Analysis of Standard Tries
  • A standard trie uses O(n) space and supports
    searches, insertions and deletions in time O(dm),
    where
  • n total size of the strings in S
  • m size of the string parameter of the operation
  • d size of the alphabet

30
Applications of Tries
  • A standard trie supports the following operations
    on a pre-processed text in time O(m), where m is
    the size of word X
  • Word Matching find the first occurrence of the
    word X in the text.
  • Prefix Matching Find the first occurrence of the
    longest prefix of word X in the text.

31
Word Matching with a Trie
  • We insert the words of the text into a trie
  • Each leaf stores the occurrences of the
    associated word in the text

32
Compressed Tries
First appearance 1968
  • Solves the following problems in the standard
    trie.
  • Creation of extra nodes in the trie
  • (Path Compression)
  • Just a different representation of the standard
    trie.

33
Compressed Tries
  • A compressed trie has internal nodes of degree at
    least two
  • It is obtained from standard trie by compressing
    chains of redundant nodes

34
Compact Representation
  • Compact representation of a compressed trie for
    an array of strings
  • Stores at the nodes ranges of indices instead of
    substrings
  • Uses O(s) space, where s is the number of strings
    in the array
  • Serves as an auxiliary index structure

35
Suffix Trie ( 11.3.3)
  • The suffix trie of a string X is the compressed
    trie of all the suffixes of X

36
Analysis of Suffix Tries
  • Compact representation of the suffix trie for a
    string X of size n from an alphabet of size d
  • Uses O(n) space
  • Supports arbitrary pattern matching queries in X
    in O(dm) time, where m is the size of the pattern
  • Can be constructed in O(n) time

37
Tries and Web Search Engines
  • The index of a search engine (collection of all
    searchable words) is stored in a compressed trie.
  • Each leaf of the trie is associated with a word
    and a list of pages (URLs) containing that word
    (called the occurrence list).
  • The trie is kept in internal memory.
  • The occurrence lists are kept in external memory
    and are ranked by relevance.

38
Tries and Web Search Engines
  • Boolean queries for sets of words (e.g. Java and
    coffee) correspond to sets of operations (e.g.
    intersection) on the occurrence lists.
  • Additional information retrieval techniques are
    used, such as
  • Stopword Elimination (as done in the standard
    tries example).
  • Stemming (e.g. identify add adding and
    added as the same word).
  • Link Analysis (recognise authoritative pages).
Write a Comment
User Comments (0)
About PowerShow.com