PXML: A Probabilistic Semistructured Data Model and Algebra - PowerPoint PPT Presentation

Loading...

PPT – PXML: A Probabilistic Semistructured Data Model and Algebra PowerPoint presentation | free to download - id: 210409-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

PXML: A Probabilistic Semistructured Data Model and Algebra

Description:

Edward Hung, Lise Getoor, V.S. Subrahmanian. University of Maryland, College Park ... value='Lore' author. 11/9/09. 5. Motivating Example. Semistructured data model ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 42
Provided by: ehu64
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: PXML: A Probabilistic Semistructured Data Model and Algebra


1
PXML A Probabilistic Semistructured Data Model
and Algebra
  • Edward Hung, Lise Getoor, V.S. Subrahmanian
  • University of Maryland, College Park
  • Database Seminar, CSIS, University of Hong Kong,
    June 2003

2
Outline
  • Motivating example
  • PXML data model
  • Semantics
  • Algebra
  • projection, selection, cross product
  • Probabilistic point query
  • Related work

3
Motivating Example
  • Bibliographic applications, citation index, e.g.
    Citeseer, DBLP
  • automatic information extraction techniques ?
    uncertainty (e.g., Fuhr, Buckley, Salton)
  • a conference paper, a journal article, a book,
    etc?
  • author? title? year?
  • different names of the same author?
  • identical/similar names for different authors?

4
Example of Semistructured Data Instance
5
Motivating Example
  • Semistructured data model
  • General hierarchical structure is known.
  • The schema is not fixed
  • Number of authors
  • Properties of authors
  • Our goal is to store and manipulate uncertain
    information probabilistically.

6
Motivating Example
  • DBLP
  • 78 authors with last name Hung
  • Many have identical initials
  • The first three in the list
  • A. C. Hung
  • Andy C. Hung
  • Angelo C. Hung
  • Uncertain when extracting a paper

7
Multidimensional Rotations for Quantization By
A.C. Hung, T.H.Y. Meng Data Compression
Conference 1994 .. .
A.C. Hung ?
(i) A. C. Hung
(ii) Andy C. Hung
(iii) Angelo C. Hung
8
6 papers existing in database
Multidimensional Rotations for Quantization By
A.C. Hung, T.H.Y. Meng Data Compression
Conference 1994 .. .
A.C. Hung, P.M. Reddy, P.J. Hammer
1989
Data Compression Conference 1993
Data Compression Conference 1993
Andy C. Hung, Teresa H. Y. Meng
Data Compression Conference 1993
Angelo C. Hung, Francis C. Wang
1985
1981
Angelo C. Hung, Miroslaw Malek
9
Multidimensional Rotations for Quantization By
A.C. Hung, T.H.Y. Meng Data Compression
Conference 1994 .. .
A.C. Hung ?
Prob 0.1 0.8 0.1
(i) A. C. Hung
(ii) Andy C. Hung
(iii) Angelo C. Hung
10
What information can we store? And how?
The set of all potential child set of
B1 T1,A1,A2, T1,A1,A3, T1,A1,A4, T1,A2,A3
, T1,A2,A4, T1,A3,A4
There are exactly ONE title and TWO authors
Cardinality constraints card(B1, title)
1,1 card(B1, author)2,2
T1
value Multidimensional Rotations for
Quantization
title
A1
value T. H. Y. Meng
author
B1
A2
value A. C. Hung
author
Weak Instance W Semistructured Instance card
author
value Andy C. Hung
A3
author
value Angelo C. Hung
A4
11
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

T1
value Multidimensional Rotations for
Quantization
title
A1
value T. H. Y. Meng
author
B1
A2
value A. C. Hung
author
Weak instance W
author
value Andy C. Hung
A3
author
card(B1, title) 1,1 card(B1, author)2,2
value Angelo C. Hung
A4
12
What information can we store? And how?
Local interpretation p assigns for every non-leaf
object an OPF (object probability function, a
conditional prob. distribution over its potential
child sets given that it exists) p(B1)(T1,A1,A2
) 0.1 p(B1)(T1,A1,A3) 0.8
p(B1)(T1,A1,A4) 0.1
Prob. Instance I Weak Instance p
Cardinality constraints card(B1, title)
1,1 card(B1, author)2,2
T1
value Multidimensional Rotations for
Quantization
title
A1
value T. H. Y. Meng
author
B1
A2
value A. C. Hung
author
author
value Andy C. Hung
A3
author
value Angelo C. Hung
A4
13
T1
T1
T1
S1
S2
S3

title
title
title
A1
A1
A1
author
author
author
B1
A2
B1
B1
A3
A4
author
author
author
P(S1) 0.1 P(S2) 0.8 P(S3) 0.1
P(Si) 0
  • Local interpretation for efficient storage and
    computation in practice.
  • Global interpretation assigns probabilities to
    each compatible instance globally, which is more
    intuitive.

14
Semantics (Local ?? Global)
  • We have defined operators to convert between
    local and global interpretations.
  • Theorems (Correctness and Reversibility)
  • The conversions from local to global
    interpretation is correct.
  • Under the conditional independence (of
    non-descendants ) assumption, the conversions
    from global to local interpretation is correct.
  • The conversion between local and global
    interpretations is reversible.

15
Example queries
  • we are only interested in titles of books but not
    the publishers or locations
  • we are not sure there exists a book called XML
    handbook or not, but we are interested to
    consider the cases that it exists
  • we have two instances with data obtained from two
    sources and we want to combine them
  • what is the probability that the book XML
    handbook exists in the database?

16
Algebra
  • Operators
  • Projection
  • Selection
  • Cross-product
  • Path expression
  • o.l1.l2ln

R.book.author
17
Algebra
  • Example of a probabilistic instance I

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
18
Algebra (Projection)
Semistructured Instance
  • Ancestor projection ( )
  • e.g., we are only interested in authors but not
    other details

19
  • D(W)
  • the set of all semistructured instances
    compatible with I

0.2
0.15
0.3
0.05
0.09
0.03
0.18
20
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
21
  • More efficient to compute locally
  • input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
22
  • output probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
p(B3)(A3)1
23
  • D(W)
  • the set of all semistructured instances
    compatible with I

0.2
0.15
0.3
0.05
0.09
0.03
0.18
24
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
25
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.30.150.050.5
0.180.090.030.20.5
26
  • input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
27
  • output probabilistic instance

card(R,book)0,1 p(R)()0.5(0.30.2) x prob.
of B3 has no child 0.5 0.5x0
0.5 p(R)(B3)(0.30.2) x prob. of B3 has a
child 0.5 x 1 0.5
card(B3, title)1,1 p(B3)(A3)1
28
  • Experiments
  • a few seconds for 300K objects and 10M OPF
    entries
  • By measuring the slopes,
  • running time is approximately linear to the
    number of objects (selected objects and their
    ancestors)
  • time to update the OPF entries of an object o is
    sub-quadratic to the number of OPF entries

29
Algebra (Selection)
  • Selection ( )
  • e.g., we are not sure whether there exists T2 as
    a title of some book, but we are interested to
    keep the possible cases where the title T2 really
    exists
  • R.book.title T2

30
  • D(W)
  • the set of all semistructured instances
    compatible with I

0.2
0.15
0.3
0.05
0.09
0.03
0.18
31
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
32
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2/0.50.4
0.09/0/50.18
0.03/0/50.06
0.18/0.50.36
33
  • input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
34
  • output probabilistic instance

card(R,book)2,3 p(R)(B1,B3)0.3/0.50.6 p(R)
(B2,B3)0.2/0.50.4
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
35
Algebra (Cross product (x))
e.g., we want to combine two instances (of
information obtained from two sources) into one
card(R, book)1,1 p(R)(B1)0.2 p(R)(B2)0.8
I1 I2
card(R, book)1,1 p(R)(B3)0.3 p(R)(B4)0.7
card(R, book)2,2
I1 x I2
p(R)(B1,B3)0.2 x 0.3 0.06 p(R)(B1,B4)0.2
x 0.7 0.14 p(R)(B2,B3)0.8 x 0.3
0.24 p(R)(B2,B4)0.8 x 0.7 0.56
36
Probabilistic point query
  • returns the probability that a given object
    satisfies a given path expression

37
  • Example of a probabilistic instance I

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
38
P(R.book.authorA1)
probability that A1 is an author of some book?
(0.60.1)
x (0.50.3)
0.7 x 0.8 0.56
card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
39
Related Work
  • Semistructured Probabilistic Objects (SPOs)
    (Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001)
  • SPO express contexts (not random variables) in a
    semistructured manner
  • PXML data model stores XML data AND probabilistic
    information.

40
Related Work
  • ProTDB (Nierman, Jagadish, in VLDB, 2002)
  • Independent probabilities assigned to each child
    vs arbitrary distributions over sets of children
  • Tree-structured
  • Our model theory provides two formal semantics
  • We propose a set of algebraic operators and
    point probabilistic query

41
Summary
  • PXML data model
  • Weak instance ( cardinality)
  • Probabilistic instance ( local interpretation)
  • Semantics
  • Local and Global Interpretations
  • Algebra
  • Projection, selection, cross product
  • Probabilistic point query
About PowerShow.com