Loading...

PPT – PXML: A Probabilistic Semistructured Data Model and Algebra PowerPoint presentation | free to download - id: 1a4815-ZDc1Z

The Adobe Flash plugin is needed to view this content

PXML A Probabilistic Semistructured Data Model

and Algebra

- Edward Hung, Lise Getoor, V.S. Subrahmanian
- University of Maryland, College Park
- ICDE, Bangalore, India, Mar 2003

Outline

- Motivating example
- Semistructured data model
- PXML data model
- Semantics
- Algebra
- Probabilistic point query
- Related work

Motivating Example

- Bibliographic applications, citation index, e.g.

Citeseer, DBLP - automatic information extraction techniques ?

uncertainty (e.g., Fuhr, Buckley, Salton) - is it a reference?
- a conference paper, a journal article, etc?
- author? title? year?
- different names of the same author?

Motivating Example

- Example queries
- we are only interested in titles of books but not

the publishers or locations - we are not sure there exists a book called XML

handbook or not, but we are interested to

consider the cases that it exists - we have two instances with data obtained from two

sources and we want to combine them - what is the probability that the book XML

handbook exists in the database?

Motivating Example

- Semistructured data model
- General hierarchical structure is known.
- The schema is not fixed
- Number of authors
- Properties of authors
- Our work store uncertain information in

probabilistic environments.

Semistructured Data Model

PXML Data Model

- Uncertainty
- Existence of sub-objects
- Number of sub-objects
- Identity of the sub-objects

PXML Data Model (Cardinality)

- Example of cardinality

card(B1, author)1,2

Weak Instance W Semistructured Instance card

PXML Data Model (Weak Instance)

- Example of a weak instance W

card(R,book)2,3

card(B1, author)1,2

card(B2, author)2,2

card(B3, author)1,1

card(B3, title)1,1

PXML Data Model

- Example of an instance compatible with W

card(R,book)2,3

card(B1, author)1,2

card(B2, author)2,2

card(B3, author)1,1

card(B3, title)1,1

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

card(B1, author)1,2

Potential child set of B1, PC(B1)

A1, A2,

A1,

A2

Probabilistic Instance I Weak Instance W

local interpretation (p)

For non-leaf objects (e.g., B1), local

interpretation (p(B1)) returns an object

probability function (OPF), which is a mapping w

PC(B1) ? 0,1 s.t. w is a valid probability

distribution.

card(B1, author)1,2

p(B1)(A1, A2) 0.5

p(B1)(A1) 0.3

p(B1)(A2) 0.2

Probabilistic Instance I Weak Instance W

local interpretation (p)

For leaf objects (e.g., T2), local interpretation

(p(T2)) returns an value probability function

(VPF), which is a mapping w from the domain of

type of T2 to 0,1 s.t. w is a legal probability

distribution.

p(T2)(XML Black Book) 0.2

p(T2)(XML Book) 0.3

p(T2)(XML) 0.5

Semantics (Local Interpretation)

- Here the local interpretation assigns the

probability to each possible set of children. - More independence assumptions are possible to

make the representation more compact - e.g. independence between authors and titles.
- e.g. all authors are all indistinguishable (e.g.,

no information about names of authors of a book).

Semantics (Global Interpretation)

- Previously, probabilities are assigned to the

actual children of each non-leaf object in a

local manner. - Now we are going to assign probabilities of each

compatible instance globally.

Semantics (Global Interpretation)

- Interpretation
- Global interpretation, P
- a mapping from D(W) (the set of semistructured

instances compatible with W) to 0,1 s.t.

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2

0.15

0.3

0.05

0.09

0.03

0.18

Semantics (Local ? Global)

- Given a semistructured instance S compatible with

a weak instance W and a local interpretation p

for W - Pp(S)Õo S p(o)(CS(o))
- CS(o) is the actual set of children of o
- Theorem
- Pp is a global interpretation for W

Semantics

S1a

p(B1)(A1)0.6

- Example
- Pp (S1a)
- p(R)(B1, B2) x p(B1)(A1) x p(B2)(A2,

A3)0.5 x 0.6 x 10.3

p(R)(B1, B2)0.5

p(B2)(A2, A3)1

Semantics (Global ? Local)

- Theorem
- Given a global interpretation P, if the

probability of any potential child of an object o

is independent of non-descendants of o, then

there exists a local interpretation p such that

Pp P

Semantics (Local ?? Global)

- We have defined operators to convert between

local and global interpretations.

Semantics (Local ?? Global)

- Theorems (Reversibility)
- The conversions from local to global

interpretation is correct. - Under the conditional independence (of

non-descendants ) assumption, the conversions

from global to local interpretation is correct. - The conversion between local and global

interpretations is reversible.

Algebra

- Operators
- Projection
- Selection
- Cross-product
- Path expression
- o.l1.l2ln

R.book.author

Algebra

- Example of a probabilistic instance I

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)

0.3 p(R)(B2,B3)0.2

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

card(B3, title)1,1 p(B3)(A3,T2)1

Algebra (Projection)

Semistructured Instance

- Ancestor projection ( )
- e.g., we are only interested in authors but not

other details

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2

0.15

0.3

0.05

0.09

0.03

0.18

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2

0.15

0.3

0.05

0.09

0.03

0.18

- More efficient to compute locally
- input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)

0.3 p(R)(B2,B3)0.2

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

card(B3, title)1,1 p(B3)(A3,T2)1

- output probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)

0.3 p(R)(B2,B3)0.2

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

p(B3)(A3)1

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2

0.15

0.3

0.05

0.09

0.03

0.18

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2

0.15

0.3

0.05

0.09

0.03

0.18

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.30.150.050.5

0.180.090.030.20.5

- input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)

0.3 p(R)(B2,B3)0.2

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

card(B3, title)1,1 p(B3)(A3,T2)1

- output probabilistic instance

card(R,book)0,1 p(R)()0.5(0.30.2) x prob.

of B3 has no child 0.5 0.5x0

0.5 p(R)(B3)(0.30.2) x prob. of B3 has a

child 0.5 x 1 0.5

card(B3, title)1,1 p(B3)(A3)1

- Experiments
- a few seconds for 300K objects and 10M OPF

entries - By measuring the slopes,
- running time is approximately linear to the

number of objects (selected objects and their

ancestors) - time to update the OPF entries of an object o is

sub-quadratic to the number of OPF entries

Algebra (Selection)

- Selection ( )
- e.g., we are not sure whether there exists T2 as

a title of some book, but we are interested to

keep the possible cases where the title T2 really

exists - R.book.title T2

Algebra (Selection)

- Selection ( )
- object selection condition
- e.g., we know that a particular author A1exists
- R.book.author A1
- value selection condition
- e.g., R.book.title XML

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2

0.15

0.3

0.05

0.09

0.03

0.18

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2

0.15

0.3

0.05

0.09

0.03

0.18

- D(W)
- the set of all semistructured instances

compatible with the weak instance W

0.2/0.50.4

0.09/0/50.18

0.03/0/50.06

0.18/0.50.36

- input probabilistic instance

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)

0.3 p(R)(B2,B3)0.2

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

card(B3, title)1,1 p(B3)(A3,T2)1

- output probabilistic instance

card(R,book)2,3 p(R)(B1,B3)0.3/0.50.6 p(R)

(B2,B3)0.2/0.50.4

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

card(B3, title)1,1 p(B3)(A3,T2)1

Algebra (Cross product (x))

e.g., we want to combine two instances (of

information obtained from two sources) into one

card(R, book)1,1 p(R)(B1)0.2 p(R)(B2)0.8

I1 I2

card(R, book)1,1 p(R)(B3)0.3 p(R)(B4)0.7

card(R, book)2,2

I1 x I2

p(R)(B1,B3)0.2 x 0.3 0.06 p(R)(B1,B4)0.2

x 0.7 0.14 p(R)(B2,B3)0.8 x 0.3

0.24 p(R)(B2,B4)0.8 x 0.7 0.56

Probabilistic point query

- returns the probability that a given object

satisfies a given path expression

- Example of a probabilistic instance I

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)

0.3 p(R)(B2,B3)0.2

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

card(B3, title)1,1 p(B3)(A3,T2)1

P(R.book.authorA1)

probability that A1 is an author of some book?

(0.60.1)

x (0.50.3)

0.7 x 0.8 0.56

card(R,book)2,3 p(R)(B1,B2)0.5 p(R)(B1,B3)

0.3 p(R)(B2,B3)0.2

card(B1, author)1,2 p(B1)(A1)0.6 p(B1)(A2)

0.3 p(B1)(A1,A2)0.1

card(B2, author)2,2 p(B2)(A2,A3)1

card(B3, author)1,1

card(B3, title)1,1 p(B3)(A3,T2)1

Other Work Done

- Implementation of a prototype
- Experiment
- Execution time is linear to the total number of

ipf entries, i.e., the instance size - A paper accepted by ICDE

Related Work

- Semistructured Probabilistic Objects (SPOs)

(Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001) - SPO express contexts (not random variables) in a

semistructured manner - PIXML data model stores XML data AND

probabilistic information.

Related Work

- ProTDB (Nierman, Jagadish, in VLDB, 2002)
- Independent probabilities assigned to each child

vs arbitrary distributions over sets of children - Tree-structured
- Our model theory provides two formal semantics
- We propose a set of algebraic operators and

point probabilistic query

Summary

- PXML data model
- Semistructured instance
- Weak instance (add cardinality)
- Probabilistic instance (add opf)
- Semantics
- Local and Global Interpretation
- Algebra
- Projection, selection, cross product
- Probabilistic point query

Related Work

- Another paper of interval probability version in

ICDT 2003 - Semantics
- Interpretations
- Satisfaction
- Consistency
- Query and r-answer (objects satisfying the query

with minimal probability no less than r)

Related Work

- Algebras TAX, SAL
- TAX (Jagadish, Lakshmanan, Srivastava, 2001)
- use pattern tree to extract subsets of nodes, one

for each embedding of pattern tree. - fixed number of children
- SAL (Beeri, Tzaban, 1999)
- bind objects to variables
- original structure is totally lost

Future Work

- System implementation
- Query optimization

Related Work

- Bayesian net (Pearl, 1988)
- random variables (probability of events)
- ours existence of children requires existence of

parents