Carnegie Mellon Univ' Dept' of Computer Science 15415 Database Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Carnegie Mellon Univ' Dept' of Computer Science 15415 Database Applications

Description:

Carnegie Mellon Univ' Dept' of Computer Science 15415 Database Applications – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 58
Provided by: christosf
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Carnegie Mellon Univ' Dept' of Computer Science 15415 Database Applications


1
Carnegie Mellon Univ.Dept. of Computer
Science15-415 - Database Applications
  • C. Faloutsos
  • Indexing and Hashing part II

2
General Overview - rel. model
  • Relational model - SQL
  • Formal commercial query languages
  • Functional Dependencies
  • Normalization
  • Physical Design
  • Indexing

3
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

4
(Static) Hashing
  • Problem find EMP record with ssn123
  • What if disk space was free, and time was at
    premium?

5
Hashing
  • A Brilliant idea key-to-address transformation

0 page
123 Smith Main str
123 page
999,999,999
6
Hashing
  • Since space is NOT free
  • use M, instead of 999,999,999 slots
  • hash function h(key) slot-id

7
Hashing
  • Typically each hash bucket is a page, holding
    many records

8
Hashing
  • Notice could have clustering, or non-clustering
    versions

9
Hashing
  • Notice could have clustering, or non-clustering
    versions

10
Indexing- overview
  • ISAM and B-trees
  • hashing
  • hashing functions
  • size of hash table
  • collision resolution
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics

11
Design decisions
  • 1) formula h() for hashing function
  • 2) size of hash table M
  • 3) collision resolution method

12
Design decisions - functions
  • Goal uniform spread of keys over hash buckets
  • Popular choices
  • Division hashing
  • Multiplication hashing

13
Division hashing
  • h(x) (axb) mod M
  • eg., h(ssn) (ssn) mod 1,000
  • gives the last three digits of ssn
  • M size of hash table - choose a prime number,
    defensively (why?)

14
Division hashing
  • eg., M2 hash on driver-license number (dln),
    where last digit is gender (0/1 M/F)
  • in an army unit with predominantly male soldiers
  • Thus avoid cases where M and keys have common
    divisors - prime M guards against that!

15
Multiplication hashing
  • h(x) fractional-part-of ( x f ) M
  • f golden ratio ( 0.618... ( sqrt(5)-1)/2 )
  • in general, we need an irrational number
  • advantage M need not be a prime number
  • but f must be irrational

16
Other hashing functions
  • quadratic hashing (bad)
  • ...
  • conclusion use division hashing

17
Design decisions
  • 1) formula h() for hashing function
  • 2) size of hash table M
  • 3) collision resolution method

18
Size of hash table
  • eg., 50,000 employees, 10 employee-records /
    page
  • Q M?? pages/buckets/slots

19
Size of hash table
  • eg., 50,000 employees, 10 employees/page
  • Q M?? pages/buckets/slots
  • A utilization 90 and
  • M prime number
  • Eg., in our case M closest prime to 50,000/10 /
    0.9 5,555

20
Design decisions
  • 1) formula h() for hashing function
  • 2) size of hash table M
  • 3) collision resolution method

21
Collision resolution
  • Q what is a collision?
  • A ??

22
Collision resolution
0 page
FULL
h(123)
123 Smith Main str.
M
23
Collision resolution
  • Q what is a collision?
  • A ??
  • Q why worry about collisions/overflows? (recall
    that buckets are 90 full)
  • A birthday paradox

24
Collision resolution
  • open addressing
  • linear probing (ie., put to next slot/bucket)
  • re-hashing
  • separate chaining (ie., put links to overflow
    pages)

25
Collision resolution
linear probing
0 page
FULL
h(123)
123 Smith Main str.
M
26
Collision resolution
re-hashing
0 page
h1()
FULL
h(123)
123 Smith Main str.
h2()
M
27
Collision resolution
separate chaining
FULL
123 Smith Main str.
28
Design decisions - conclusions
  • function division hashing
  • h(x) ( axb ) mod M
  • size M 90 util. prime number.
  • collision resolution separate chaining
  • easier to implement (deletions!)
  • no danger of becoming full

29
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

30
Hashing vs B-trees
  • Hashing offers
  • speed ! ( O(1) avg. search time)
  • ..but

31
Hashing vs B-trees
  • ..but B-trees give
  • key ordering
  • range queries
  • proximity queries
  • sequential scan
  • O(log(N)) guarantees for search, ins./del.
  • graceful growing/shrinking

32
Hashing vs B-trees
  • thus
  • B-trees are implemented in most systems
  • footnotes
  • hashing is not (why not?)
  • dbm and ndbm of UNIX offer one or both

33
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

34
Indexing in SQL
  • create index ltindex-namegt on ltrelation-namegt
    (ltattribute-listgt)
  • create unique index ltindex-namegt on
    ltrelation-namegt (ltattribute-listgt)
  • drop index ltindex-namegt

35
Indexing in SQL
  • eg.,
  • create index ssn-index
  • on STUDENT (ssn)
  • or (eg., on TAKES(ssn,cid, grade) )
  • create index sc-index
  • on TAKES (ssn, c-id)

36
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics (theoretical interest)
  • dynamic hashing
  • multi-attribute indexing

37
Problem with static hashing
  • problem overflow?
  • problem underflow? (underutilization)

38
Solution Dynamic/extendible hashing
  • idea shrink / expand hash table on demand..
  • ..dynamic hashing
  • Details how to grow gracefully, on overflow?
  • Many solutions - One of them extendible hashing

39
Extendible hashing
0 page
FULL
h(123)
123 Smith Main str.
M
40
Extendible hashing
0 page
solution split the bucket in two
FULL
h(123)
123 Smith Main str.
M
41
Extendible hashing
  • in detail
  • keep a directory, with ptrs to hash-buckets
  • Q how to divide contents of bucket in two?
  • A hash each key into a very long bit string
    keep only as many bits as needed
  • Eventually

42
Extendible hashing
directory
0001...
0111...
00...
01...
10101...
10...
10011...
10110...
11...
1101...
101001...
43
Extendible hashing
directory
0001...
0111...
00...
01...
10101...
10...
10011...
10110...
11...
1101...
101001...
44
Extendible hashing
directory
0001...
0111...
00...
01...
10101...
10...
split on 3-rd bit
10011...
10110...
11...
101001...
1101...
45
Extendible hashing
directory
0001...
0111...
00...
01...
new page / bucket
10...
10011...
10101...
11...
101001...
10110...
1101...
46
Extendible hashing
directory (doubled)
new page / bucket
47
Extendible hashing
BEFORE
AFTER
0001...
0111...
10011...
1101...
48
Extendible hashing
  • Summary directory doubles on demand
  • or halves, on shrinking files
  • needs local and global depth (see book)
  • Mainly, of theoretical interest - same for
  • linear hashing of Litwin
  • order preserving
  • perfect hashing (no collisions!)

49
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

50
multiple-key access
  • how to support queries on multiple attributes,
    like
  • gradegt3 and course415
  • major motivation Geographic Information systems
    (GIS)

51
multiple-key access
y
x
52
multiple-key access
  • Typical query
  • Find cities within x miles from Pittsburgh
  • thus, we want to store nearby cities on the same
    disk page

53
multiple-key access
y
x
54
multiple-key access
y
x
55
multiple-key access - R-trees
y
x
56
multiple-key access - R-trees
  • R-trees very successful for GIS
  • (along with z-ordering)
  • more details at advanced topics, later
  • even more details in 15-826

57
Indexing- overview
industry workhorse
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing
Write a Comment
User Comments (0)
About PowerShow.com