Chapter 7:

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 7:

1

Chapter 7 Selected Algorithms
7.1 External Search
7.2 External Sorting
7.3 Text searching

2
7.2 External Sorting

Problem Sorting big amount of data, as in
external searching, stored in blocks (pages).
efficiency number of the access to pages should
be kept low!
Strategy Sorting algorithm which processes the
data sequentially (no frequent page exchanges)
MergeSort!

3
General form for Merge

mergesort(S) retorna el conjunto S ordenado
if(S es vacío o tiene sólo 1 elemento)
return(S)
else
Dividir S en dos mitades A y B
A'mergesort(A)
B'mergesort(B)
return(merge(A',B'))

Start n data in a file g1,
divided in pages of size b
Page 1 s1,,sb
Page 2 sb1,s2b
Page k s(k-1)b1 ,,sn
( k n/b )
When sequentially processed only k page accesses
instead of n.

5
Variation of MergeSort for external sorting

MergeSort Divide-and-Conquer-Algorithm
for external sorting without divide-step,
only merge.
Definition run ordered subsequence within a
file.
Strategy by merging increasingly bigger
generated runs until everything is sorted.

6
Algorithm

1. Step Generate from the sequence in the input
file g1
starting runs and distribute them in two
files f1 and f2,
with the same number of runs (?1) in each.
(for this there are many strategies, later).
Now use four files f1, f2, g1, g2.

2. Step (main step)
While the number of runs gt 1 repeat
Merge each two runs from f1 and f2 to a double
sized run alternating to g1 und g2, until there
are no more runs in f1 and f2.
Merge each two runs from g1 and g2 to a double
sized run alternating to f1 and f2, until there
are no more runs in g1 und g2.
Each loop two phases

Example
Start
g1 64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8,
12, 50
1st. step (length of starting run 1)
f1 64 3 79 19 67 8 50
f2 17 99 78 13 34 12
Main step,
1st. loop, part 1 (1st. Phase )
g1 17, 64 78, 79 34, 67 50
g2 3, 99 13, 19 8, 12
1st. loop, part 2 (2nd. Phase)
f1 3, 17, 64, 99 8, 12, 34, 67
f2 13, 19, 78, 79 50

9
Example continuation

1st. loop, part 2 (2nd. Phase)
f1 3, 17, 64, 99 8, 12, 34, 67
f2 13, 19, 78, 79 50
2nd. loop, part 1 (3rd. Phase)
g1 3, 13, 17, 19, 64, 78, 79, 99
g2 8, 12, 34, 50, 67
2nd. loop, part 2 (4th. Phase)
f1 3, 8, 12, 13, 17, 19, 34, 50, 64, 67, 78,
79, 99
f2

10
Implementation

For each file f1, f2, g1, g2 at least one page of
them is stored in principal memory (RAM), even
better, a second one might be stored as buffer.
Read/write operations are made page-wise.

11
Costs

Page accesses during 1. step and each phase
O(n/b)
In each phase we divide the number of runs by 2,
thus
Total number of accesses to pages O((n/b) log
n),
when starting with runs of length 1.
Internal computing time in 1 step and each phase
is O(n).
Total internal computing time O( n log n ).

12
Two variants of the first step creation of the
start runs

A) Direct mixing
sort in primary memory (internally) as many
data as possible, for example m data sets
? First run of a (fixed!) length m,
thus r n/m starting runs.
Then we have the total number of page accesses
O( (n/b) log(r) ).

13
Two variants of the first step creation of the
start runs

B) Natural mixing
Creates starting runs of variable length.
Advantage we can take advantage of ordered
subsequences that the file may contain
Noteworthy starting runs can be made longer by
using the replacement-selection method by having
a bigger primary storage !

14
Replacement-Selection

Read m data from the input file in the primary
memory (array).
repeat
mark all data in the array as now.
start a new run.
while there is a now marked data in the array
select the smallest (smallest key) from all now
marked data,
print it in the output file,
replace the number in the array with a number
read from the input file (if there are still
some) mark it now if it is bigger or equal to
the last outputted data, else mark it as not
now.
Until there are no data in the input file.

Example array in primary storage with capacity
of 3
The input file has the following data
64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50
In the array (not now data written in
parenthesis)
Runs 3, 17, 64, 78, 79, 99 13, 19, 34, 67
8, 12, 50

64 17 3
64 17 99
64 79 99
78 79 99
(19) 79 99
(19) (13) 99
(19) (13) (67)
19 13 67
19 34 67
(8) 34 67
(8) (12) 67
(8) (12) (50)
8 12 50
12 50
50
16
Implementation

In an array
At the front Heap for now marked data,
At the back refilled not now data.
Note all now elements go to the current
generated run.

Expected length of the starting runs using the
replace-select method
2m
(m size of the array in the primary storage
number of data that fit into primary
storage)
by equally probabilities distribution
Even bigger if there is some previous sorting!

18
Multi-way merging

Instead of using two input and two output files
(alternating f1, f2 and g1, g2)
Use k input and k output files, in order to me
able to merge always k runs in one.
In each step take the smallest number among the
k runs and output it to the current output file.

Cost
In each phase number of runs is devided by k,
Thus, if we have r starting runs we need only
logk(r) phases
(instead of log2(r)).
Total number of accesses to pages
O( (n/b) logk(r) ).
Internal computing time for each phase O(n log2
(k))
Total internal computing time
O( n log2(k) logk(r)) O( n log2(r) ).

20
Chapter 7.3

Text searching according to Boyer and Moore
Position index und matching direction
ShiftRight as static funktion
Bad Character Heuristic
Good-Suffix Heuristic

21
Text searching

Problem test if a pattern (string) s appears in
a text (string) t or not.
With a naïve approach Algorithm taking
O(s t ).
Now better algorithms
Knuth, Morris, Pratt (1977)
Boyer und Moore (1977).

22
Naive Algorithm Operations ohne1(String) ?
String, anf1(String) ? char Algorithm
prefix(s, t String) ? Boolean if (s empty)
then output true exit if (t empty) then
output false exit if anf1(s) anf1(t)
    then output prefix(ohne1(s),ohne1(t))
else output false algorithm SubString(s, t
String) ? Boolean res false while (t
not empty) and (resfalse)      perform if
prefix(s,t) then res true
     else t ohne1(t) output res
Cost O( s t )
23
The Knuth, Morris, Pratt (KMP) Algorithm
The naive algorithm shifts the pattern 1
possition to the right when a mismath
happens. The KMP algorithm exploits the
characteristics of the pattern in order to shift
it to the right as far as possible. How ? If
there are some sub-pattern repeated in the
pattern we can use them in the following way
24
Knuth, Morris, Pratt (2)

Lets asume that comparing the text with the
pattern at a certain point we have j characters
that match but the character at the position j1
does not

Text
Pattern
25
Knuth, Morris, Pratt (3)

If we have a coincidence of the last i characters
of the pattern (that is from the character j-i1
to j, including both) that matched the text with
the i first characters of the pattern, then we
can move the pattern to the right and start
checking

a sufix of length i
Pattern
a prefix of length i
26
Knuth, Morris, Pratt (4)

An interesting characteristic of this approach is
that we can calculate beforehand (before starting
the search) how much we can shift the pattern,
because it depends on the pattern itsef (does it
has similarities inside?)
We will define the so called failure function
Be the pattern composed by the characters
b1b2..bm
f(j) max( i lt j, max(b1 ... bj bj-i1 ...
bj, j 1..m)

Pattern
27
Algorithmus von Knuth, Morris, Pratt (5)

After defining this function we can explain the
algorithm the following way Start comparing the
the text with the pattern from left to right at
position k 0. if at a certain position k the
character of the text does not match with the
character of the pattern at the position j1 ,
then continue comparing the pattern at f(j)1
from k on (because we know that the characters
before already match)

Text
Pattern
28
The Algorithm (in pseudo-JAVA) assuming f(i)
already calculated
// n length of the text // m length of the
pattern // indexes start from 1 int k0 int
j0 while (kltn jltm) while (jgt0
textk1!patternj1) jfj if
(textk1)patternj1)) j k
// jm gt matching k n gt failure
29
Construction of the f(i) function
// m length of the pattern // indexes begin with
1 int fnew intm f10 int j1 int i
while (jltm) ifj while (igt0
patterni1!patternj1) ifi if
(patterni1 patternj1) fj1i1
else fj10 j
30
Algorithmus von Boyer und Moore

Ideen
Verschiebe das Wort s allmählich von links nach
rechts, aber
Vergleiche Wort s mit Text t im Wort s von rechts
nach links.
Zwei Heuristiken zum Verschieben des Suchstrings
s.
Bad-Character-Heuristik
Good-Suffix-Heuristik
Aufwand auch O(ts).

31
Heuristiken
32
Erläuterungen zum Bild
In a) wird der Suchstring "reminiscence" von
rechts nach links mit dem Text verglichen. Das
Suffix "ce" stimmt überein, aber der
"Bad-Character" "i" stimmt nicht mehr mit dem
korrespondierenden "n" des Suchstrings überein.
In b) wird der Suchstring nach der
Bad-Character-Heuristik so weit nach rechts
verschoben, bis der "Bad-Character" "i" mit dem
am weitesten rechts auftretenden Vorkommen von
"i" im Suchstring übereinstimmt. In c) wird
nach der Good-Suffix-Heuristik das gefundene
"Good-Suffix" "ce" mit dem Suchstring
verglichen. Kommt dieses Suffix ein weiteres Mal
im Suchstring vor, so kann der Suchstring so
weit verschoben werden, dass dieses erneute
Auftreten mit dem Text übereinstimmt.
33
Die "Bad-Character Heuristik"
Matchfehler an der Stelle j mit sj ? tposj,
1 ? j ? d (pos ist die Stelle vor dem aktuellen
Beginn des Suchstrings) 1) Das falsche Zeichen
tposj tritt im Suchstring nicht auf. Nun
können wir ohne Fehler den Suchstring um j
weiterschieben. 2) Das falsche Zeichen tposj
tritt im Suchstring auf. Sei nun k der
größte Index mit 1 ? k ? d, an dem sktposj
gilt. Ist dann kltj, so wollen wir den
Suchstring um j-k weiterschieben. Hier
haben wir dann mindestens eine Übereinstimmung im
Zeichen sk tposj. Man kann den Wert
k im voraus für jedes verschiedene Zeichen
des Suchstrings als Funktion b(a) bestimmen,
wobei a aus dem erlaubten Alphabet ist.
b(a) gibt die Position des am weitesten
rechts stehenden Auftreten vom Zeichen a im
Suchstring an. Damit ist eine Verschiebung
um j - k j - b(tpos j). zu machen. 3)
Gilt allerdings kgtj, so liefert die Heuristik
einen negativen Shift j - k, der ignoriert
wird, also Verschiebung um 1.
34

Liste des rechtesten Wiedervorkommens im
blauen Suchstring
http//wwwmayr.informatik.tu-muenchen.de/lehre/199
9SS/proseminar/jakob/
35
Beispiel BCH
Rechtestes Auftreten im Suchstring finden
36
"Good-Suffix Heuristik"
Angenommen, wir haben einen Matchfehler an der
Stelle j mit sj ? tposj, 0 ? j ? d gefunden
(die weiter rechts liegenden Zeichen stimmen also
überein, pos ist die aktuelle Position in t ).
Gilt j d, so schieben wir den Suchstring
einfach um eine Position weiter. Gilt jedoch
jltd, so haben wir d-j Übereinstimmungen. Das
Suffix des Suchstrings s der Länge d-j und der
passende Textstring t von der Stelle pos1 an
stimmen links von posd in d-j Zeichen überein.

pos
s
j1
d
j
0
37
Die Good-sufix Funktion
Nun berechnen wir für jede Position j im
Suchstring die Größe gj d- maxk 0 ? k lt
d (sj 1...d ist
Suffix von sk oder sk ist Suffix von sj
1...d). g heißt dann "Good-Suffix"-Funktion
und kann im Vorhinein für alle 0 ? j ? d
berechnet werden. Sie gibt die kleinste Anzahl
von Zeichen an, um die wir den Suchstring s nach
rechts schieben können, ohne Übereinstimmungen
mit dem Text zu verlieren. snennen s1 n, s2
ne, s3 nen, s4nenn, s5 nenne, s6
nennen s6..6 n, s5..5 en, s3..5 nen,
s2..5 nnen, s1..5 ennen, g0
6-max1,3, g13, g23, g33, g43,
g56-4
s
d
j1
j
0
k
38
Good suffix alternativ
L' und l' für das Beispiel-Suchmuster
l'pos Länge des längsten Suffix in
Musterpos..n, das auch Präfix ist.L'pos
Rechtes Ende der rechtesten Kopie von
Musterpos..n.
39
Good Suffix BeispielAchtung Verschiebung um 1
Länge d11
Pos0, j6, g(6)11-65
Pos7, j5, g(5)11-38
kltd, g(0)11-38
Fazit 11 Gesamtlänge. Die gegebene Heuristik
arbeitet gut
40
Weitere Beispiele Wir kennen keinen
nennenswerten Fall nennen Hier ist d6, j4
und der Buchstabe k tritt nicht im Suchstring
auf. Wir können demnach den String nach der
Bad-Charakter Heuristik um 4 Plätze
weiterschieben. Good-Suffix-Heuristik Das
Good-Suffix ist en Verschiebung um 3 Positionen
Wir kennen keinen nennenswerten Fall nennen
Nunmehr kommt der Mismatch-Buchstabe n im
Suchstring viermal vor. Das maximale Vorkommen
ist k6. Wir müssen also die Good-Suffix
Heuristik anwenden. Im Vorhinein haben wir g5
6-42 berechnet und können den Suchstring um
zwei Plätze nach rechts weiterschieben Wir
kennen keinen nennenswerten Fall nennen
Hier ist j1. Die Bad-Character Heuristik
ermöglicht uns lediglich, den String um eine
Position nach rechts zu verschieben. Das
Good-Suffix ist jedoch ennen, und das Präfix nen
das Suchstrings ist ein Suffix des Good-Suffix.
Wir haben also vorher schon g1 6-33
berechnet. Die Good-Suffix Heuristik erlaubt uns
also, den Suchstring um drei Positionen nach
rechts weiterzuschieben.

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 7: PowerPoint PPT Presentation