Loading...

PPT – Data Structures(????) Course 2:Searching PowerPoint presentation | free to download - id: 69558f-ZjViM

The Adobe Flash plugin is needed to view this content

Data Structures(????)Course 2Searching

Vocabulary

- sequential search ????
- element ??
- order ??
- binary search ????
- target ??
- algorithm ??
- array ??
- location ??
- object ??,??
- parameter ??

- index ??,??,??
- sentinel ??
- probability ??
- key ???
- hash ??,??
- collision ??
- cluster ??,??
- synonym ???,???
- probe ??
- load factor ????

Searching

- One of the most common and time-consuming

operations in computer science. - To find the location of a target among a list of

objects.

Main contents(in chapter 2)

- List searching(including two basic search

algorithms) - Sequential search(including three variations)
- Binary search
- Hashed list searchingthe key through an

algorithmic function determines the location of

data - Collision resolution
- To discuss the list search algorithms using an

array structure

2-1 list searches (work with arrays)

- The algorithm used to search a list depends to

the structure of list - Sequential search(any array)
- List no ordered
- Small lists
- Not searched often

Locating data in unordered list

Location wanted (3)

A0

A1

A11

4 21 36 14 62 91 8 22 7 81 77 10

Target given (14)

Search Concept

Target given14 Location wanted3

Search Concept

Sequential search algorithms

- Needs to tell the calling algorithm two things
- Did it Find the data it was looking for?
- If it did, at what index are the target data

found. - Requires four parameters
- The list we are searching
- An index to the last element in the list
- The target
- The address where the found elements index

location is to stored - (Return Boolean)

sequential search algorithm

Locate the target in an unordered list Pre list

must contain at least one element last is index

to last element in the list target contains the

data to be located locn is address of index in

calling algorithm Post if foundmatching index

stored in locn found true If not foundlast

stored in locn found false Return foundltbooleangt

- algorithm seqsearch(val list ltarraygt
- val last ltindexgt
- val target

ltkeytypegt - ref locn

ltindexgt) - looker0
- loop (looker lt last and
- target not equal list looker)
- looker looker 1
- end loop
- locn looker
- if (target equal list looker)
- found true
- else
- found false
- end if
- return found
- end seqsearch

Variations on sequential searches

- Sentinel search
- Probability search
- Ordered list search

Sentinel search

Locate the target in an unordered list Pre list

must contain at least one element Last is index

to last element in the list Target contains the

data to be located Locn is address of index in

calling algorithm Post if foundmatching index

stored in locn found true If not foundlast

stored in locn found true Return foundltbooleangt

- algorithm seqsearch(val list ltarraygt
- val last

ltindexgt - val target

ltkeytypegt - ref locn

ltindexgt) - List last 1 target
- looker0
- loop (target not equal list looker)
- looker looker 1
- end loop
- locn looker
- if (looker lt last)
- found true
- locn looker
- else
- found false
- locn last
- end if
- return found
- end sentinel search

probability search

- looker0
- loop (looker lt last and target not equal list

looker) - looker looker 1
- end loop
- if (target equal list looker)
- found true
- if ( looker gt 0 )
- temp list looker 1
- list looker 1 list

looker - list looker temp
- looker looker 1
- endif
- else
- found false
- end if
- locn looker
- return found
- end probability search

Locate the target in an unordered list Pre as

the same above Post if foundmatching index

stored in locn found true Element move up

in priority If not foundas same Return

foundltbooleangt

Ordered list search

- Locate target in a list ordered on target
- Note
- It is not necessary to search to the end of list
- It is only for the small list
- Incorporate the Sentinel
- Pre the same as sequential
- Post
- if foundthe same as above
- If not foundlocn is index of first element gt

target or locn equal last found is false - Return found lt boolean gt

- If (target lt listlast )
- looker0
- loop (target gt list looker)
- looker looker 1
- end loop
- else
- looker last
- endif
- if (target equal listlooker)
- found true
- else
- found false
- end if
- locn looker
- return found

Binary search

- Sequential search algorithm is very slow
- But, It is the only solution if the array is not

sorted - Binary search(ordered list)
- For the large list
- First sort
- Then search

Binary search method

- Suppose
- L a sorted list
- searching for a value X
- Compare X to the middle value (M) in L.
- if X M we are done.
- if X lt M we continue our search, but we can

confine our search to the first half of L and

entirely ignore the second half of L. - 4.if X gt M we continue, but confine ourselves to

the second half of L.

First

mid

last

Target not found --Target 11 is not in the list

Binary search(ordered list )

Pre list is ordered it must contain at least one

element end is index to the largest element in

the list Target is the value of element being

sought Locn is address of index in calling

algorithm Post Foundlocn assigned index to

target element found set true not foundlocn

element below or above target found set

false Return foundltbooleangt

else found equal force

exit first last 1 end if

end loop locn mid if (target equal list

mid) found true else found

false end if return found end binary search

- algorithm binary_search(
- val list ltarraygt,
- val end ltindexgt,
- val target ltkeytypegt,
- ref locn ltindexgt)
- First 0
- Last end
- loop (first lt last )
- mid ( first last ) / 2
- if ( target gt list mid )
- look in upper half
- first mid 1
- else if ( target lt list mid )
- look in lower half
- last mid 1

Analyzing (the efficiency)

- Sequential search ,Sentinel search ,Ordered list

search O(n) - Binary search O(log 2n)
- Comparison of binary and sequential searches

size binary Sequential (average) Sequential (worst case)

16 4 8 16

10,000 14 5000 10,000

1,000,000 20 500,000 1,000,000

2-3 Hashed list searches

Ideal search we would know exactly where the

data are and go directly to

there Goal of hashed search to find the data

with only

one test

Use an array of data

Hash function

address

key

address

5

102002 107095 111060

100

hash

2

key

Figure 2-6 Hash concept

Basic Concepts

Hash search A search in which the key

,through an algorithmic function,

determines the location of the data. we use

a hashing algorithm to transform the key into the

index that contains the data we need to

locate (key-to address)

Problem

A set of keys hash to the same locationSynonym

Contain two or more synonyms in a listcollision

Home addressproduced by hashing algorithm

Prime areamemory contains all of home addresses

Collision resolutiontwo keys collide at a home

address Place one of the keys and its data in

another location

B and A Collide at 8

Collision resolution

C and B Collide at 16

C A B

0

4

16

8

Collision resolution

1.hash(A)

3.hash(C)

2.hash(B)

Figure 2-7 the collision resolution concept

Locate an element in a hashed list Use the same

algorithm to insert it into the list First hash

the key and check the home address If it does

the search is complete If not use the collision

resolution algorithm to determine the next

location and continue until find the element or

determine it is not in the list Each calculation

of an address and test for success probe

Hashing methods

Hashing methods

modulo division

direct

rotation

midsquare

pseudorandom generation

digit extraction

subtraction

folding

Figure 2-8 Basic hashing techniques

Direct method

- The key is the address(an element a key , no

synonyms) - Example1 total monthly sales by the days of the

months - Create an array of 31accumulator
- The accumulation code is

dailySalessale.day dailySalessale.day

sale.amount

Example 2 a small company has fewerlt100 Employee

number is between 1 and 100

000

000 (not used)

001 Harry lee

002 Sarah trapp

003

004

005 Vu nguyen

006

007

008

099

100 John adams

001

002

003

address

004

5

005

005 100 002

100

hash

006

2

007

008

key

Figure 2-9 Direct hashing Of employee numbers

099

100

Subtraction method

- keys are consecutive , but do not start from 1
- Such as your student ID number
- Advantage
- Hashing function is very simple
- No collisions
- Disadvantage
- Only for small lists

Note 1. Generally speaking , hashing lists

require some empty elements to reduce the number

of collisions 2. This application above two is

the ideal ,but it is very limited , such as ID

card number

Modulo-division method(Division remainder)

This method divides the key by the array size and

uses the remainder for the address Hashing

algorithm is

Address key modulus listsize

Note a prime number listsize produces fewer

collisions

000

379452 Marry Dodd

121267 Bryan Devaux

378845 John Carver

160252 Tuan Ngo

045128 Shouli Feldman

001

002

2

003

121267 045128 379452

306

hash

004

0

005

006

007

008

Listsize307

305

Figure 2-10 modulo-division Hashing

306

Digit extraction method Selected digits are

extracted from the key And used as

address Example

379452 121267 378845 160252 045128

394 112 388 102 051

6-digits Employee number

3-digit address

Select the first, third, fourth digits

Midsquare method

The key is squared and the address selected from

the middle of the squared number Limitation the

size of the key Example 4-digit keys

9452945289340304address is 3403

Variation select a portion of the key

379452 121267 378845 160252 045128

379 379143641 121 121014641 378

378142884 160 160025600 045 045002025

364 464 288 560 202

Select 3-5 digits as address

Select 1-3 digits

Fill 0 to 6 digits

squared

Folding methods fold shift and fold boundary

123456789

Digits reversed

321

123

123

456

789

123

456

789

987

789

Digits reversed

764

1

1

368

discarded

discarded

(b)fold boundary

(a)fold shift

Figure 2-11 hash fold examples

Rotation method Incorporate with others

Useful when keys are assigned serially

600101 600102 600103 600104 600105

600101 600102 600103 600104 600105

160010 260010 360010 460010 560010

Original key

Rotation

Rotated key

Figure 2-12 Rotation hashing

Pseudorandom method

In this method, the key is used as the seed in a

pseudorandom number generator , the resulting

random number is scaled into the possible address

range using modulo division

A common random generator is yaxc For

efficiency,factors a and c should be prime

numbers For example , a17, c7

(170451287) modulo 307297

000

379452 Marry Dodd

121267 Bryan Devaux

378845 John Carver

045128 Shouli Feldman

160252 Tuan Ngo

(171212677) modulo 30741

007

41

121267 045128 379452

041

297

hash

7

(173794527) modulo 3077

297

Figure 2-10 modulo-division Hashing

306

Hash Algorithm

- Convert the alphanumeric key into a number by

adding the American Standard Code for Information

Interchange(ASCII) to accumulator. - Rotate the bits in the address to maximize the

distribution of the values. - Take the absolutely value of the address and map

it into the address range.

Hash Algorithm

This algorithm converts an alphanumeric key of

size characters into an integral address. Pre Key

is a key to be hashed. size is the number

of characters in the key.

MaxAddr is the maximum possible

address for the list. Post addr contain

the hashed address

- algorithm Hash(
- val key ltarray gt,
- val size ltintegergt,
- val maxAddr ltintegergt,
- ref addr ltintegergt)
- Looper 0
- Addr 0
- Hash Key
- Loop (Loopltsize)
- if (keylooper not space)
- addr addrkeylooper
- rotate addr 12 bits right
- end if
- End loop

test for negative address if

(addrlt0) addrabsolute(addr) end if

addr addr modulo maxaddr return end

Hash

2-4 collision resolution

- Except the direct and subtraction, none of the

hashing methods are one-to-one mapping - Collision not avoid
- There are several methods for hashing collisions

Collision resolution

Open addressing

Linked lists

buckets

pseudorandom

Key offset

Linear probe

Quadratic probe

Figure 2-13 collision resolution methods

Several concepts

- data to group within the list (unevenly across a

hashed list). - a high degree of clustering grows the number of

probes to locate an element and reduces the

processing efficiency of the list. There are two - Primary clustering when data cluster around a

home address - Secondary clusteringwhen data become grouped

along a collision path throughout a list - Need to design hashing algorithms to minimize

clustering

- load factor
- Clustering

Open addressing

- Resolves collisions in the prime area (contains

all of the home addresses ) - Linear probe
- Quadratic probe
- Double hashing
- Pseudorandom
- Key offset

Linear Probe

000

379452 Marry Dodd

070918 Sarah Trapp

121267 Bryan Devaux

166702 Harry eagle

378845 John Carver

160252 Tuan Ngo

045128 Shouli Feldman

001

002

First insert No collision

003

004

1

070918 166702

005

hash

006

1

007

008

second insert collision Add 1

305

Figure 2-14 linear probe collision resolution

306

linear probe

Variation Add 1, subtract 2,Add 3, subtract 4

Advantage simple to implement.

Disadvantage first, tend to produce primary

clustering . Second, tend to make the search

algorithm more complex

Quadratic probe

- To eliminate primary clustering
- The increment is the collision probe number

squared.first probe, add 12,second probe, add 22

, - The new address is the modulo of the list size.
- Disadvantage
- 1. the time required to square the probe

number. - 2. It is not possible to generate a new

address for every element in the list.

Pseudorandom collision resolution

- A double hashing the address is rehashed
- Uses a pseudorandom number to resolve the

collision - Using the collision address as a factor in the

random number calculation, such as

New address 3 collision address 5

Figure2-15 showing a collision resolving for

figure 2-14

Pseudorandom probe

000

379452 Marry Dodd

070918 Sarah Trapp

121267 Bryan Devaux

378845 John Carver

166702 Harry eagle

160252 Tuan Ngo

045128 Shouli Feldman

001

002

First insert No collision

003

004

1

005

070918 166702

hash

006

1

007

008

second insert collision

Pseudorandom Y 3x5

305

306

Figure 2-15 pseudorandom collision resolution

Key offset

- Another double hashing
- Produces different collision paths for different

keys - key offset calculates the new address as (the

simplest versions)

offset ?key/listsize? address ((offset old

address) modulo listsize)

Example the key is 166702, list size is

307,using the modulo-division generate an address

of 1 This synonym of 070918 produces a collision

at 1 Using key offset to calculate the next

address

offset ?166702 / 307? 543 address ((543

001) modulo 307) 237

If 237 were also a collision, repeat the process

offset ?166702 / 307? 543 address ((543

237) modulo 307) 166

To really see the effect of key offset, we need

to calculate several different keys ,all hashing

to the same home address. Table 2-3 shows that

three keys that collide at address 001, Next two

collision probe addresses

Key28 Home address Key offset Probe 1 Probe 2

166702 1 543 237 166

572556 1 1865 024 047

067234 1 219 220 132

Table 2-3 key offset

Note each key resolves its collision at a

different address for both the first and second

probes

Linked list resolution

- To eliminate the disadvantage of open addressing

that each collision resolution increases the

probability of future collisions - A linked list is an ordered collection of data in

which each element contains the location of the

next element

000

379452 Marry Dodd

070918 Sarah Trapp

121267 Bryan Devaux

160252 Tuan Ngo

045128 Shouli Feldman

166702 Harry eagle

001

002

572556 Chris Wallj

003

004

pointer

pointer

005

006

007

008

305

306

Figure 2-16 linked list collision resolution

Linked list resolution

- Linked list resolution uses a separate area to

store collisions and chains all synonyms together

in a linked list - It uses two storage areas, the prime area and the

overflow area - Each element in the prime area contains an

additional field, a link head pointer - The linked list data can be stored in any order,

but the most common is key sequence

Bucket hashing

Bucket 0 379452 Marry Dodd

Bucket 0

Bucket 0

Bucket 1 070918 Sarah Trapp

Bucket 1 166702 Harry eagle

Bucket 1 367173 Ann georgis

Bucket 2 121267 Bryan Devaux

Bucket 2 572556 Chris wallj

Bucket 2

Bucket 307 045128 Shouli Feldman

Bucket 307

Bucket 307

000

nodes that accommodate multiple data.

occurrences, collision are postponed until the

bucket is full

001

002

Linear probe Places here

307

Figure 2-17 bucked hashing

Two problems combination approaches

- First it uses significantly more space, many of

the buckets will be (or partially) empty - Second it does not completely resolve the

collision problem - Resolving the collision is to use the linear

probe - There are several approaches to resolving

collisions ,often uses multiple steps - Example one large database hashes to a bucket,

full, linear probe , linked list overflow area

summary

- Searching is the process of finding the location

of a target among a list of objects - Two basic searching methods for arrays

sequential and binary search - The sequential search is normally used when a

list is not sorted. It starts at the beginning of

the list and searches until it finds the data or

hits the end of the list - One of the variation of the sequential search is

the sentinel search. In this method,the condition

ending the search is reduced to only one by

artificially inserting the target at the end of

the list - The second variation of the sequential search is

called the probability search. In this method,

the list is ordered with the most probable

elements at the beginning of the list and the

least probable at the end

2-5 summary(continued)

- The sequential search can also be used to search

a sorted list, in this case, we can terminate the

search when the target is less than the current

element - If an array is sorted, we can use a more

efficient algorithm called the binary search - the binary search algorithm searches the list by

first checking the middle element. If the target

is not in the middle element, the algorithm

eliminates the upper half or the lower half of

the list depending on the value of the middle

element. The process continues until the target

is found or reduced list length becomes zero - The efficiency of a sequential search is O(n)
- The efficiency of a binary search is O(log2n)

summary(continued)

- In a hashed search,the key through an algorithmic

transformation,determines the location of the

data. It is a key-to-address transformation - There are several hashing functions we

discussed direct, subtraction, modulo division,

digit extraction, mid-square, folding, rotation ,

and pseudorandom generation

summary(continued)

- In direct hashing,the key is the address without

any algorithmic manipulation - In subtraction hashing,the key is transformed to

an address by subtracting a fixed number from it - In modulo-division hashing,the key is divided by

the list size,recommended to be a prime number - In digit-extraction hashing,selected digits are

extracted from the key and used as an address - In mid-square hashing,the key is squared and the

address is selected from the middle of the result - In fold shift hashing,the key is divided into

parts whose sizes match the size of the required

address.then the parts are added to obtain the

address

summary(continued)

- In fold boundary hashing,the key is divided into

parts whose sizes match the size of the required

address.then the left and right parts are

reversed and added to the middle part to obtain

the address - In rotation hashing,the rightmost digit of the

key is rotated to the left to determine an

address. However,this method is usually used in

combination with other methods - In the pseudorandom generation hashing,the key is

used as the seed to generate a pseudorandom

number. The result is then scaled to obtain the

address - Except in the direct and subtraction methods,

collisions are unavoidable in hashing. Collision

occur when a new key is hashed to an address that

is already occupied

summary(continued)

- Clustering is the tendency of data to build up

unevenly across a hashed list. - Primary clustering occur when data build up

around a home address - Secondary clustering occurs when data build up

along a collision path in the list - To solve a collision, a collision resolution

method is used - Three general methods are used to resolve

collision open addressing,linked list,and

buckets - The open addressing method can be subdivided into

linear probe,quadratic probe,pseudorandom

rehashing,and key-offset rehashing

summary(continued)

- In the linear probe method,when the collision

occurs,the new data will be stored in the next

available address. - In the quadratic method,the increment is the

collision probe number squared. - In the pseudorandom rehashing method, we use a

random number generator to rehash the address - In the key-offset rehashing method,we use an

offset to rehash the address

summary(continued)

- In the linked list technique,we use separate

areas to store collision and chain all synonyms

together in a linked list - In bucket hashing,we use a bucket that can

accommodate multiple data occurrences

Homework

- Using the modulo-division method and linear

probing, store the keys shown below in an array

with 19 elements, How many collision occurred?

The value of load factor of the list after all

keys have been inserted? - 224562,137456,214562,140145,214567,162145,144467,

199645,234534 - Repeat above problem using the digit-extraction

method (first, third and fifth digits) and

quadratic probing.