Loading...

PPT – SFTree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions wi PowerPoint presentation | free to view - id: 29bd3-MjkyY

The Adobe Flash plugin is needed to view this content

SF-Tree An Efficient and Flexible Structure for

Estimating Selectivity of Simple Path Expressions

with Accuracy Guarantee

- Ho Wai Shing

Contents

- Motivation
- Problem Definition
- Perfect Binary SF-Tree
- Variations of SF-Tree
- Experiments
- Conclusion

Motivation

Motivation

- XML is becoming more and more popular
- e.g., product catalogs, invoices, documentations,

etc - We need to extract useful information out of them
- Queries, usually in form of XQuery or XPath, are

important

Motivation

Motivation

- e.g.,
- FOR i IN document()//buyer/name
- j IN document()//seller/name
- WHERE i/text()j/text()
- RETURN i

Motivation

- To evaluate this query, different evaluation

plans are available - FOR i IN document()//buyer/name
- j IN document()//seller/name
- WHERE i/text()j/text()
- RETURN i

Motivation

- To select the best query plan, we need the

selectivity of simple path expressions (SPEs) - To retrieve selectivity counts, we need a

structure to store suitable statistics of the

original XML document collection

Motivation

- We may use path indices to store this

SPE-to-selectivity mapping - e.g., DataGuide, path tree, 1-index etc.
- disadvantage designed for absolute path

expressions (APEs) and inefficient for simple

path expressions (SPEs)

Motivation

Motivation

- To increase efficiency, we may use suffix trie to

store the mapping - advantage really quick, no need to scan the

whole index tree - disadvantage very large, especially when the

data tree is complex

Motivation

Motivation

- These structures are not flexible
- wont change with frequently asked queries
- cant make trade-offs between space and speed
- We can tolerate some errors in selectivity

retrieval and it is good to know a bound on the

error

Motivation

- Our goal to build a new structure for estimating

the selectivity of SPEs which, - is faster than the path indices
- requires less space than suffix tries
- provides flexible trade-off between space and

time - provides accuracy guarantee on the estimated

selectivity

Problem Definition

Problem Definition

- XML document
- is a text document that logically represents a

directed edge-labeled graph called data graph - consists of Elements, Attributes and other

information items

Problem Definition

Problem Definition

- Data Paths
- a set of information items (e1, e2, , en)
- each ei is directly nested within ei-1 for 2 ? i

? n - i.e., a path in a data graph (not necessarily

starts from root)

Problem Definition

- Simple Path Expression (SPE)
- asks about the structure of XML documents
- has a form //t1/t2/ /tn where ti are tags
- returns all information items (en) which is the

end of a data path (e1, e2, , en) that satisfies

the SPE

Problem Definition

- Satisfy a data path (e1, e2,
, en) satisfies an

SPE (//t1/t2/ /tn) if - every ei has a tag ti
- Selectivity (or counts, selectivity counts)
- selectivity of an SPE (//t1/t2/
/tn) is the

number of information items ens that are

returned by the SPE

Problem Definition

- e.g.,
- //buyer/name, //name, //invoice/buyer/name are

all SPEs - selectivity counts of //name is 2, //products is

1, //invoice is 1

Problem Definition

Problem Definition

- The Problem
- given a simple path expression p and a collection

D of XML documents, - find the estimated selectivity s1 of p in D while

s1 should be as close to the actual selectivity s

of p in D as possible

SF-Tree

SF-Tree

- Basic Idea
- divide all possible SPEs in an XML document

collection into distinct groups - each SPE in a group has the same selectivity (or

similar selectivity) - the task of finding the selectivity of an SPE now

becomes the task of finding which selectivity

group this SPE belongs to

SF-Tree

- e.g.,

SF-Tree

- e.g.,

SF-Tree

- In each group, we use a signature file to store

all the paths that belong to this group - This reduces the storage requirement, improves

the accessing speed but induces errors to the

structure

Signature Files

- A traditional technique to improve the efficiency

of keyword-based information retrieval - Each keyword is hashed into m bit positions of a

signature file F - For each keyword in the document, the

corresponding m bits in F are set

Signature Files

- e.g.,
- F has 10 bits, m 3
- //name is hashed to bits 2, 3, 8
- //buyer is hashed to bits 2, 4, 9

Signature Files

- To check the existence of a query keyword k in F,

k is hashed and we check whether all the m bits

are set or not - If any of the m bits are unset, k is not in F
- If all m bits are set, k is very likely to be in

F

Signature Files

- e.g., (cont)
- if a query keyword //seller is hashed to bits 2,

8, 9, - F will say //seller is in it, but it is not true
- bits 2, 8, 9 are in fact set by other keywords
- a false drop occurred

Signature Files

- Previous analysis showed that in optimal case,
- false drop rate is (1/2)m
- F m D / ln(2) where D is the set of all

keywords in the document - In our case, we treat an SPE in a group as a

keyword and a group as a document

SF-Tree

- There are many groups
- for 16-bit selectivity counts, there are 216

groups if each group has a distinct selectivity

count - To get the selectivity, we need to find which

group this SPE belongs to - Linear scan of all groups is not feasible
- slow
- error prone

SF-Tree

- organize the groups into a tree form
- called an SF-Tree (stands for Signature File

Tree) - Rationale create representative groups over

several groups, and if an SPE is not in a

representative group, it will never be in any of

descendant groups of this representative group.

SF-Tree

SF-Tree

SF-Tree

- Basic selectivity retrieval algorithm
- 1. start from the root,
- 2. check which child contains the SPE p
- 3. descend to that child, repeat step 2, 3 until

a leaf is reached - 4. return the selectivity associated with that

leaf

Perfect Binary SF-Tree

- A very straightforward way to build

representative nodes is to build the nodes

according to the most significant bits of the

selectivity count - i.e., all the descendants has the same m.s.b. in

their selectivity counts - A perfect binary SF-Tree is thus formed

Perfect Binary SF-Tree

- advantage
- easy to build
- easy to analyze
- disadvantage
- less flexible
- cannot tackle FAQ
- not that space efficient

Perfect Binary SF-Tree

- properties (proofs are skipped)
- low error rate
- let d be depth of the SF-Tree, a be the depth

difference between the false drop leaf, correct

leaf and their least common ancestor - for negative queries, false drop rate is Rneg(d)
- for positive queries, max error size 2a,

Rpos(d, a)

Perfect Binary SF-Tree

- high accessing speed
- assuming no false drop,
- Tneg(d) 2 signatures
- Tpos(d) 2d signatures
- it doesnt depend on the complexity of the XML

documents

Perfect Binary SF-Tree

- each false drop induces at most 2 more signature

checks - expected number of signatures to be checked for

positive query - C

Perfect Binary SF-Tree

- storage
- each SPE must appear at one leaf and all the

ancestors of that leaf - each SPE contributes m/ln2 bits to each signature

that contains it - space requirement (m/ln2) ?v (vdv) where v is

a leaf node

Perfect Binary SF-Tree

- properties -- summary
- negligible error rate (if m 10, d 16, error

rate for a negative query is 1/2144!) - error rate decreases as error size increases for

positive queries - error rate and accessing time are independent of

the complexity of the XML documents

Variations of SF-Tree

Variations of SF-Tree

- drawbacks of a perfect binary SF-Tree
- every SPE has the same depth in SF-Tree, cannot

optimize for FAQ - too many layers, we can tolerate a higher error

rate for smaller space requirement

Variations of SF-Tree

- Huffman SF-Tree
- Shannon-Fano SF-Tree
- Cropped SF-Tree

Huffman SF-Tree

- build a Huffman tree on the groups of selectivity
- the weight of each group can be
- A. the number of SPEs in that group
- B. the probability of a random query will fall

into that group

Huffman SF-Tree

Huffman SF-Tree

- advantage
- optimal in space required over all binary

SF-Trees if the weight is (A). - optimal in expected number of signatures to be

checked if the weight is (B). - disadvantage
- error rate is independent of error size

Shannon-Fano SF-Tree

- build a Shannon-Fano tree over the selectivity

groups - the order of the leaves are not changed
- similar depth as in Huffman tree
- error rate now decreases as error size increases

Shannon-Fano SF-Tree

- We observed that the distribution of SPE over all

selectivity is very skewed - most of the SPE has a very small selectivity
- the distribution is similar to a GP with ratio

smaller than 1. - The use of Huffman or Shannon-Fano tree much

reduces the depth of SF-Tree

Shannon-Fano SF-Tree

Cropped SF-Tree

- Cropped SF-Tree is an easy trade-off between

access time, space and accuracy - two types of cropped SF-Tree
- head-cropped SF-Tree (the top h levels of an

SF-Tree is removed) - tail-cropped SF-Tree (the bottom t levels of an

SF-Tree is removed)

Cropped SF-Tree

Head-cropped SF-Tree

- height of the SF-Tree decreases
- space required is smaller
- accessing time increases because we have to check

all the children of the root in the first step - accuracy decreases a little bit because the

maximum possible value of a decreases.

Tail-cropped SF-Tree

- height of the SF-Tree decreases
- space required is smaller
- accessing time is shorter (we can traverse fewer

levels) - accuracy is lowered (a leaf now represents a

range of selectivity, i.e., the quantization step

size of selectivity is larger)

Experiments

Experiments

- Data Sets

Experiments

- Query Workloads
- positive SPEs randomly pick 2000 SPEs from all

the SPEs in the XML document collection. Each

SPE has equal probability of being picked - negative SPEs randomly concatenate tags in the

XML document collection, and then remove those

with non-zero selectivity until 2000 SPEs are

generated

Experiments

Comparison to Competitors

Performance of SF-Tree

Performance of SF-Tree

Performance of SF-Tree

Conclusions

Conclusions

- A new structure, SF-Tree, is introduced
- SF-Tree is much quicker and path indices like

path trees and uses much less space than suffix

trie - SF-Tree provides means to make a trade-off

between space, time and accuracy

Conclusions

- Although SF-Tree returns only estimated

selectivity, but accuracy is guaranteed - Neg queries false drop rate 1/2d(m-1),
- Pos queries false drop rate 1/2ma-a1 max

error size 1/2a

Discussions

- SF-Tree is a new approach to approximate some

"string-to-number" structures - It may be applicable to other aspects such as

approximating a multi-dimensional histogram

Questions?

References

- A. Aboulnaga, A. R. Alameldeen, and J. F.

Naughton. Estimating the selectivity of XML path

expressions for internet scale applications. In

VLDB, 591--600, 2001. - C. Faloutsos and S. Christodoulakis. Signature

files An access method for documents and its

analytical performance evaluation. ACM TOIS,

2(4)267--288, 1984. - R. Goldman and J. Widom. Data Guides Enabling

query formulation and optimization in

semistructured databases. In VLDB, pp. 436--445,

1997. - L. Lim, M. wang, S. Padmanabhan, J. S. Vitter,

and R.Parr. XPathLearner an on-line self-tuning

markov histogram for XML path selectivity

estimation. In VLDB 2002. - T. Milo, D. Suciu. Index structures for path

expressions. In ICDT, 1999. - N. Polyzotis and M. Garofalakis. Statistical

synopses for graph-structured XML databases. In

SIGMOD, 2002.

References

- N. Polyzotis and M. Garofalakis. Structure and

value synopses for XML data graphs. In VLDB,

2002. - W3C. Extensible markup language (XML) 1.0.

http//www.w3.org/TR/1998/REC-xml-19980210