PXML: A Probabilistic Semistructured Data Model and Algebra - PowerPoint PPT Presentation

Loading...

PPT – PXML: A Probabilistic Semistructured Data Model and Algebra PowerPoint presentation | free to download - id: 1a4815-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

PXML: A Probabilistic Semistructured Data Model and Algebra

Description:

automatic information extraction techniques uncertainty (e.g., Fuhr, Buckley, Salton) ... we are only interested in titles of books but not the publishers or locations ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 49
Provided by: ehu64
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: PXML: A Probabilistic Semistructured Data Model and Algebra


1
PXML A Probabilistic Semistructured Data Model
and Algebra
  • Edward Hung, Lise Getoor, V.S. Subrahmanian
  • University of Maryland, College Park
  • ICDE, Bangalore, India, Mar 2003

2
Outline
  • Motivating example
  • Semistructured data model
  • PXML data model
  • Semantics
  • Algebra
  • Probabilistic point query
  • Related work

3
Motivating Example
  • Bibliographic applications, citation index, e.g.
    Citeseer, DBLP
  • automatic information extraction techniques ?
    uncertainty (e.g., Fuhr, Buckley, Salton)
  • is it a reference?
  • a conference paper, a journal article, etc?
  • author? title? year?
  • different names of the same author?

4
Motivating Example
  • Example queries
  • we are only interested in titles of books but not
    the publishers or locations
  • we are not sure there exists a book called XML
    handbook or not, but we are interested to
    consider the cases that it exists
  • we have two instances with data obtained from two
    sources and we want to combine them
  • what is the probability that the book XML
    handbook exists in the database?

5
Motivating Example
  • Semistructured data model
  • General hierarchical structure is known.
  • The schema is not fixed
  • Number of authors
  • Properties of authors
  • Our work store uncertain information in
    probabilistic environments.

6
Semistructured Data Model
7
PXML Data Model
  • Uncertainty
  • Existence of sub-objects
  • Number of sub-objects
  • Identity of the sub-objects

8
PXML Data Model (Cardinality)
  • Example of cardinality

card(B1, author)1,2
Weak Instance W Semistructured Instance card
9
PXML Data Model (Weak Instance)
  • Example of a weak instance W

card(R,book)2,3
card(B1, author)1,2
card(B2, author)2,2
card(B3, author)1,1
card(B3, title)1,1
10
PXML Data Model
  • Example of an instance compatible with W

card(R,book)2,3
card(B1, author)1,2
card(B2, author)2,2
card(B3, author)1,1
card(B3, title)1,1
11
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

12
card(B1, author)1,2
Potential child set of B1, PC(B1)
A1, A2,
A1,
A2
13
Probabilistic Instance I Weak Instance W
local interpretation (p)
For non-leaf objects (e.g., B1), local
interpretation (p(B1)) returns an object
probability function (OPF), which is a mapping w
PC(B1) ? 0,1 s.t. w is a valid probability
distribution.
card(B1, author)1,2
p(B1)(A1, A2) 0.5
p(B1)(A1) 0.3
p(B1)(A2) 0.2
14
Probabilistic Instance I Weak Instance W
local interpretation (p)
For leaf objects (e.g., T2), local interpretation
(p(T2)) returns an value probability function
(VPF), which is a mapping w from the domain of
type of T2 to 0,1 s.t. w is a legal probability
distribution.
p(T2)(XML Black Book) 0.2
p(T2)(XML Book) 0.3
p(T2)(XML) 0.5
15
Semantics (Local Interpretation)
  • Here the local interpretation assigns the
    probability to each possible set of children.
  • More independence assumptions are possible to
    make the representation more compact
  • e.g. independence between authors and titles.
  • e.g. all authors are all indistinguishable (e.g.,
    no information about names of authors of a book).

16
Semantics (Global Interpretation)
  • Previously, probabilities are assigned to the
    actual children of each non-leaf object in a
    local manner.
  • Now we are going to assign probabilities of each
    compatible instance globally.

17
Semantics (Global Interpretation)
  • Interpretation
  • Global interpretation, P
  • a mapping from D(W) (the set of semistructured
    instances compatible with W) to 0,1 s.t.

18
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
19
Semantics (Local ? Global)
  • Given a semistructured instance S compatible with
    a weak instance W and a local interpretation p
    for W
  • Pp(S)Õo S p(o)(CS(o))
  • CS(o) is the actual set of children of o
  • Theorem
  • Pp is a global interpretation for W

20
Semantics
S1a
p(B1)(A1)0.6
  • Example
  • Pp (S1a)
  • p(R)(B1, B2) x p(B1)(A1) x p(B2)(A2,
    A3)0.5 x 0.6 x 10.3

p(R)(B1, B2)0.5
p(B2)(A2, A3)1
21
Semantics (Global ? Local)
  • Theorem
  • Given a global interpretation P, if the
    probability of any potential child of an object o
    is independent of non-descendants of o, then
    there exists a local interpretation p such that
    Pp P

22
Semantics (Local ?? Global)
  • We have defined operators to convert between
    local and global interpretations.

23
Semantics (Local ?? Global)
  • Theorems (Reversibility)
  • The conversions from local to global
    interpretation is correct.
  • Under the conditional independence (of
    non-descendants ) assumption, the conversions
    from global to local interpretation is correct.
  • The conversion between local and global
    interpretations is reversible.

24
Algebra
  • Operators
  • Projection
  • Selection
  • Cross-product
  • Path expression
  • o.l1.l2ln

R.book.author
25
Algebra
  • Example of a probabilistic instance I

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
26
Algebra (Projection)
Semistructured Instance
  • Ancestor projection ( )
  • e.g., we are only interested in authors but not
    other details

27
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
28
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
29
  • More efficient to compute locally
  • input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
30
  • output probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
p(B3)(A3)1
31
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
32
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
33
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.30.150.050.5
0.180.090.030.20.5
34
  • input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
35
  • output probabilistic instance

card(R,book)0,1 p(R)()0.5(0.30.2) x prob.
of B3 has no child 0.5 0.5x0
0.5 p(R)(B3)(0.30.2) x prob. of B3 has a
child 0.5 x 1 0.5
card(B3, title)1,1 p(B3)(A3)1
36
  • Experiments
  • a few seconds for 300K objects and 10M OPF
    entries
  • By measuring the slopes,
  • running time is approximately linear to the
    number of objects (selected objects and their
    ancestors)
  • time to update the OPF entries of an object o is
    sub-quadratic to the number of OPF entries

37
Algebra (Selection)
  • Selection ( )
  • e.g., we are not sure whether there exists T2 as
    a title of some book, but we are interested to
    keep the possible cases where the title T2 really
    exists
  • R.book.title T2

38
Algebra (Selection)
  • Selection ( )
  • object selection condition
  • e.g., we know that a particular author A1exists
  • R.book.author A1
  • value selection condition
  • e.g., R.book.title XML

39
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
40
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2
0.15
0.3
0.05
0.09
0.03
0.18
41
  • D(W)
  • the set of all semistructured instances
    compatible with the weak instance W

0.2/0.50.4
0.09/0/50.18
0.03/0/50.06
0.18/0.50.36
42
  • input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
43
  • output probabilistic instance

card(R,book)2,3 p(R)(B1,B3)0.3/0.50.6 p(R)
(B2,B3)0.2/0.50.4
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
44
Algebra (Cross product (x))
e.g., we want to combine two instances (of
information obtained from two sources) into one
card(R, book)1,1 p(R)(B1)0.2 p(R)(B2)0.8
I1 I2
card(R, book)1,1 p(R)(B3)0.3 p(R)(B4)0.7
card(R, book)2,2
I1 x I2
p(R)(B1,B3)0.2 x 0.3 0.06 p(R)(B1,B4)0.2
x 0.7 0.14 p(R)(B2,B3)0.8 x 0.3
0.24 p(R)(B2,B4)0.8 x 0.7 0.56
45
Probabilistic point query
  • returns the probability that a given object
    satisfies a given path expression

46
  • Example of a probabilistic instance I

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
47
P(R.book.authorA1)
probability that A1 is an author of some book?
(0.60.1)
x (0.50.3)
0.7 x 0.8 0.56
card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)
0.3 p(R)(B2,B3)0.2
card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)
0.3 p(B1)(A1,A2)0.1
card(B2, author)2,2 p(B2)(A2,A3)1
card(B3, author)1,1
card(B3, title)1,1 p(B3)(A3,T2)1
48
Other Work Done
  • Implementation of a prototype
  • Experiment
  • Execution time is linear to the total number of
    ipf entries, i.e., the instance size
  • A paper accepted by ICDE

49
Related Work
  • Semistructured Probabilistic Objects (SPOs)
    (Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001)
  • SPO express contexts (not random variables) in a
    semistructured manner
  • PIXML data model stores XML data AND
    probabilistic information.

50
Related Work
  • ProTDB (Nierman, Jagadish, in VLDB, 2002)
  • Independent probabilities assigned to each child
    vs arbitrary distributions over sets of children
  • Tree-structured
  • Our model theory provides two formal semantics
  • We propose a set of algebraic operators and
    point probabilistic query

51
Summary
  • PXML data model
  • Semistructured instance
  • Weak instance (add cardinality)
  • Probabilistic instance (add opf)
  • Semantics
  • Local and Global Interpretation
  • Algebra
  • Projection, selection, cross product
  • Probabilistic point query

52
Related Work
  • Another paper of interval probability version in
    ICDT 2003
  • Semantics
  • Interpretations
  • Satisfaction
  • Consistency
  • Query and r-answer (objects satisfying the query
    with minimal probability no less than r)

53
Related Work
  • Algebras TAX, SAL
  • TAX (Jagadish, Lakshmanan, Srivastava, 2001)
  • use pattern tree to extract subsets of nodes, one
    for each embedding of pattern tree.
  • fixed number of children
  • SAL (Beeri, Tzaban, 1999)
  • bind objects to variables
  • original structure is totally lost

54
Future Work
  • System implementation
  • Query optimization

55
Related Work
  • Bayesian net (Pearl, 1988)
  • random variables (probability of events)
  • ours existence of children requires existence of
    parents
About PowerShow.com