A Fast Index for Semistructured Data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

A Fast Index for Semistructured Data

Description:

Can augment simple search paths with descriptive and semantic ... Search 'california' Index Fabric (8) 0. 2. 3. 4. california. calimesa. car. calexico. cat ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 38
Provided by: ramsudhir
Category:

less

Transcript and Presenter's Notes

Title: A Fast Index for Semistructured Data


1
A Fast Index for Semistructured Data
  • Brian Cooper, Neal Sample, Michael Franklin,
    Gisli Hjaltson, Moshe Shadmon

Slides and Presentation by Ram Nandyala 26 May
2005
2
Overview
  • Introduction
  • Index Fabric
  • Indexing XML with the Index Fabric
  • Experimental Results
  • Conclusion

3
Introduction
  • What is semistructured data?
  • data with an irregular or changing organization
  • does not conform to a particular schema
  • How to manage semistructured data?
  • RDBMS
  • Oracle, etc.
  • Specialized data managers
  • Lore, Tamino, XYZFind

4
Index Fabric
  • Encodes XML data paths as strings
  • Features
  • Does not impose a rigid structure or data model
    on the information being indexed
  • Can be used to represent the complex
    relationships that often exist among data
    elements.
  • Can augment simple search paths with descriptive
    and semantic information to enhance the richness
    of queries

5
Index Fabric (2)
  • PATRICIA trie
  • compressed trie nodes with only one child are
    removed
  • indexes large number of strings in a compact and
    efficient structure
  • instead of storing entire key, indexes
    differences between keys hence, slow growth

6
Index Fabric (3)
  • PATRICIA trie
  • nodes labeled with depth - the character position
    in the key represented by the path
  • size of trie is independent of key lengths
  • agressive (but lossy) compression

0
a
c
2
t
l
r
at
california
car
cat
7
Index Fabric (4)
0
a
c
  • But.
  • The tree can get very unbalanced

2
t
r
at
3
e
i
car
cat
4
m
f
calexico
california
calimesa
8
Index Fabric (5)

0
a
c
  • Solution
  • Divide the trie into block-sized sub-tries
  • Index sub-tries with a second trie

ca
2
t
r
at
3
cal
e
i
car
cat
4
cali
m
f
calexico
california
calimesa
9
Index Fabric (6)

0
a
c

0
ca
2
c
t
r

2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
10
Index Fabric (7)
  • Searching Index Fabric
  • Start from root node of left-most layer
  • If edge is far link, proceed horizontally to
    block in next layer
  • If no labeled edge matches, follow direct link to
    new block in next layer
  • Continue to rightmost layer until desired data is
    found (or not)

11
Index Fabric (8)

0
Search california
a
c

0
ca
2
c
t
r

2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
12
Index Fabric (8)

0
Search california
a
c

0
ca
2
c
t
r

2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
13
Index Fabric (8)

0
Search california
a
c

0
ca
2
c
t
r

2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
14
Index Fabric (8)

0
Search california
a
c

0
ca
2
c
t
r

2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
Must verify!
far link
15
Index Fabric (8)

0
Search citation
a
c

0
ca
2
c
t
r

2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
Not a match!!
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
16
Index Fabric (9)
  • Searching (continued)
  • Ideally, one block per layer accessed

17
Index Fabric (10)
  • Insertion

0
a
c
2
0
t
l
r
c
at
california
car
cat
18
Index Fabric (11)
c
  • Insertion

0
a
c
2
0
t
l
r
at
l
4
i
2
m
car
cat
california
calimesa
19
Indexing XML with the Index Fabric
  • Encoding Scheme
  • Designators
  • Designator Dictionary
  • Raw Paths
  • Refined Paths

20
Indexing XML with the Index Fabric (2)
  • Designators
  • Special characters or strings used to encode data
    paths
  • Unique designator assigned to each tag and
    attrribute appearing in XML document
  • e.g. I ltinvoicegt B ltbuyergt N ltnamegt

ltinvoicegtltbuyergtltnamegtABC Corplt/namegtlt/buyergtlt/i
nvoicegt Is encoded as I B N ABC Corp
21
Indexing XML with the Index Fabric (3)
  • Designator Dictionary
  • mapping between tag names and their assigned
    designators
  • when XML document is parsed, tag names matched to
    designators
  • tag names from queries translated to designators
    to form search key

22
Indexing XML with the Index Fabric (4)
  • Raw Paths
  • Index the hierarchical structure of XML by
    encoding root-to-leaf paths as strings
  • assume no a priori knowledge of queries or
    structure
  • Prefix-encoding scheme

23
Indexing XML with the Index Fabric (5)
ltinvoicegt ltbuyergt ltnamegtOracle
Inclt/namegt ltphonegt555-1212lt/phonegt
lt/buyergt ltsellergt ltnamegtIBMlt/namegt
lt/sellergt ltitemgt ltcountgt4lt/countgt
ltnamegtnaillt/namegt lt/itemgt ltinvoicegt
ltinvoicegt ltbuyergt ltnamegtABC Corplt/namegt
ltaddressgt1 Industrial Waylt/addressgt
lt/buyergt ltsellergt ltnamegtAcme Inclt/namegt
ltaddressgt2 Acme Rd.lt/addressgt lt/sellergt
ltitem count3gtsawlt/itemgt ltitem
count2gtdrilllt/itemgt ltinvoicegt
Doc. 1
Doc. 2
ltinvoicegt I ltbuyergt B ltsellergt S ltnamegt N ltaddr
essgt A ltphonegt P ltitemgt T ltcountgt C count C
IBNOracle Inc IBP555-1212 ISNIBM ITC4 ITNnail
IBNABC Corp IBA1 Industrial Way ISNAcme Inc ISA2
Acme Rd. ITC3 ITsaw ITC2 ITdrill
24
Indexing XML with the Index Fabric (6)
25
Indexing XML with the Index Fabric (7)
  • Raw Paths (continued)
  • New documents can be added any time

26
Indexing XML with the Index Fabric (8)
  • Refined Paths
  • Specialized paths that optimize frequently
    occurring queries
  • Encoded similar to raw paths
  • Stored in same index as raw paths

27
Indexing XML with the Index Fabric (9)
  • Refined Paths (continued)
  • Example Find invoices where Acme Inc. sold to
    ABC Corp.
  • Designator, say Z, is assigned to this path, and
    encoded with information in query
  • Index Fabric key is Z Acme Inc. ABC Corp.

28
Indexing XML with the Index Fabric (10)
  • Improving Query Processing
  • Raw Paths
  • Simple path expressions
  • General path expressions
  • Refined Paths
  • Query processor recognizes path expression as
    refined path, translated to search key

29
Experimental Results
  • Index Fabric
  • Native RDBMS index (B-tree)
  • Edge mapping
  • STORED system
  • Tested using data from DBLP archive

30
Experimental Results (2)
  • Basic Edge Mapping
  • treats XML as set of nodes and edges
  • two tables
  • roots(id, label)
  • edges(parentid, childid, label)
  • key-compressed B-tree indexes
  • roots(id), roots(label)
  • edges(parentid), edges(childid), edges(label)

31
Experimental Results (3)
  • STORED system
  • partial schema extracted from XML, using data
    mining
  • nonconforming data stored in overflow buckets, in
    similar method as edge mapping
  • key-compressed B-tree indexes

32
Experimental Results (4)
  • Queries

33
Experimental Results (5)
  • I/O Blocks

34
Experimental Results (6)
  • Time - Seconds

35
Experimental Results (7)
  • Query B find conference paper by author

36
Experimental Results (8)
  • Query D find publications by co-authors

37
Conclusion
  • Fast Index for efficiently accessing XML and
    other semistructured data
  • Outperforms existing mechanisms of handling
    semistructured data, sometimes on an order of
    magnitude
Write a Comment
User Comments (0)
About PowerShow.com