Searching and Integrating Information on the Web

About This Presentation

Title:

Searching and Integrating Information on the Web

Description:

Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 58

Provided by: Chen187

Category:

more less

Transcript and Presenter's Notes

Title: Searching and Integrating Information on the Web

1
Searching and Integrating Information on the Web

Seminar 4 Ranking Queries and Data Privacy
Professor Chen Li
UC Irvine

2
Outline and readings

Ranking Queries
Fagin, R., Combining Fuzzy Information from
Multiple Systems, PODS 1996
Fagin et al., Optimal Aggregation Algorithms for
Middleware, PODS 2001.
Data privacy
Database-as-service
Executing SQL over Encrypted Data in the
Database-Service-Provider Model. Hakan Hacigumus,
Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD
2002.
XML Data publishing
Secure XML Publishing without Information Leakage
in the Presence of Data Inference. Xiaochun Yang
and Chen Li. To appear in VLDB'04

3
Outline

Ranking Queries
Data privacy
XML Data publishing
Database-as-service

4
Top-k queries

Finding multi-attribute tuples with top-k highest
scores
Scoring function aggregating scores on
attributes, e.g., w1A1 wn An, where wi
is the weight for attribute Ai.
Monotone aggregation functions if tuple A has a
higher grade than tuple B on each attribute, then
As overall grade is higher than Bs.

5
Applications

Multimedia databases
Web search queries
Restaurants
Houses
Cars

6
Modes of Data Access (Fagin)

Underlying Middleware (e.g., Search engines,
Garlic, QBIC) supports 2 modes
1. Sorted access
- Attribute Ai (column) forms a list Li sorted
based on the score of Ai.
- The list is output one by one.
2. Random access
- Ask the system for the grade of any given
object
Goal minimize the total cost to get the top-k
results

year
mileage
price
b e f . . .
a c e . . .
a d e . . .
Sorted lists
7
FA Fagins algorithm PODS96

Do sorted access in parallel to each of the m
sorted lists Li. Wait until there is a set H of
at least k objects such that each of these
objects has been seen in each of the m lists.
For each object R that has been seen, do random
access as needed to each of the lists Li to find
the i-th field xi or R.
Compute the aggregate results.

8
Example
year
mileage
price
b e f . . .
a c e . . .
a d e . . .
Cut-off line

Suppose k 1. Given the three partial lists
retrieved so far, e appears in all of them. We
can say that the top 1 tuple must be in
a,b,c,e,d,f.
Reason since the function is monotonic, tuple
e blocks all tuples below, since they can
only have a smaller overall grade than e.
The algorithm does random access for these 5
tuples to get their grades, and pick the top 1.
Notice that we cannot say e must be the top 1,
since other tuples (e.g., a) may still have a
higher overall score
Minor point one possible improvement f can
never be better than e.

9
General case
year
mileage
price
k
k
k
Cut-off line

Once k tuples have appeared in all the partial
lists, halt.
Reason these k tuples block all the tuples
below, which cannot be better than these k tuples
Do random access for the retrieved tuples to get
their overall grades, and find the top-k.

10
FAs Properties

Can correctly find top-k results for monotone
aggregation functions
Cost of a database with N objects
O(N(m-1)/mK1/m) with arbitrarily high
probability.

11
FAs Drawbacks

The number of sorted accesses is still large.
Since all seen tuples should be buffered, the
required buffer size is unbounded.
Does not exploit the bound given by the
aggregation function to determine when to stop
sorted access.

12
TA Threshold Algorithm PODS2001

Do sorted access in parallel to each of the m
sorted lists. As an object R is seen under sorted
access in some list, do random access to the
other lists to find the grade xi of object R in
other lists. Then compute the aggregate grade for
this object R. If this is one of the highest,
insert it, else discard it.
For each list Li, let xi be the grade of the last
object seen under sorted access. Define the
threshold value T to be t( x1, , xm). As soon as
at least k objects have been seen whose grade is
at least equal to T, then halt.
Return the K objects that have been seen with the
highest grades.

13
Example
mileage
year
price
buffer for top-k
b e f . . .
a d e . . .
a c e . . .
Threshold window

A buffer keeps the top-k tuples that have been
found so far
For any tuple in a sorted list, do a random
access to get its overall grade. Compare it with
the tuples in the buffer queue, and decide to
insert it or discard it.
Threshold window (including the previous m
records) represents the best top-k results we
can see, assuming we can combine best values from
different tuples.
Notice that this window may not be horizontal
if we use different speeds to access different
lists
This window helps us decide when to stop once we
find k tuple whose grade is at least equal to the
window tuple, we halt.

14
TAs Properties

TA is optimal for all monotone functions and over
every database.
Compared to FA, TA requires a small,
constant-size buffer.
TA allows early stopping
Can show TA never stops later than FA. (Why?)
There are times when the user is satisfied with
approximate top k list. TA is modified to give
such approximation.
TA can be modified to the case where random
access is impossible

15
Instance Optimality

Algorithm b is instance optimal over an algorithm
set A and a database instance set D, if b is in
A, and for any algorithm a in A and every
instance d in D, we have cost (b,D)
O(cost(a,D)).
Similar to competitive ratio
Essentially b is the best algorithm in A.
Stronger than optimality in a worst-case case
TA is instance optimal in all correct
algorithms (nondeterministic algorithms).

b
A
a
16
Variations of TA

NRA When no random access is possible
Example Web search engines, which typically do
not allow you to enter a URL and get its ranking
TAZ When no sorted access is possible for some
predicates
Example Find good restaurants near location x
(sorted and random access for restaurant ratings,
random access only for distances from a mapping
site)
CA When the relative costs of random and sorted
accesses matter.
TA? Only when approximate answers are needed
Example Web search, with lots of good quality
answers

17
Outline

Ranking Queries
Data privacy
XML Data publishing
Database-as-service

18
Motivation

Privacy in publishing XML data
Applications
Web publishing
Data sharing and exchange, e.g., in P2P systems

19
Example Hospital XML data
hospital
(1)
(2)
(2)
(3)
(4)
physician
patient
(1)
patient
physician
patient
patient
...
...
(1)
phname
pname
(4)
Smith
Walker
Tom
W403
cancer

Goal hide Alices disease
Common Knowledge patients in the same ward have
the same disease

20
Problem

Given
An XML document to be published
Sensitive data in the document
Common knowledge using which public users can do
data inference
Find
A partial document to be released so that users
cannot infer the sensitive data

21
Research challenges

How to model data inference using common
knowledge?
How to compute all possible inferred data?
How to compute a partial document to be published
without leaking sensitive information?

22
Roadmap

? Information Leakage
Defining sensitive data
Describing common knowledge
Computing inferred documents
Prevent information leakage

23
Defining sensitive data

Using an XQuery, called regulating query
A special node marked to indicate the
sensitive data

24
Example 1
hospital
(2)
(3)
patient
(1)
patient
patient

Map the query to the XML tree
For each mapping, the target of the node is
sensitive.

25
Example 2
hospital
(2)
(3)
patient
(1)
patient
patient
26
Common Knowledge

Represented as XML constraints
Could be obtained in various ways, e.g.,
possible schema
analysis from the published data

27
Common Constraints

Child constraints //p ? //p/c
//patient ? //patient/pname
Descendant constraints //p ? //p//d
//patient ? //patient//disease
Functional dependencies //p/a?//p/b
//patient/ward ? //patient/disease

Patient
Patient
pname
Patient
Patient
disease
Patient
Patient
If w1 w2, then d1 d2
ward
disease
ward
disease
(value equal)
w1
w2
d1
d2
28
Modify partial document using constraints
Partial document P
C1 //patient ? //patient/pname C2
//patient ? //patient//disease C3
//patient/ward ? //patient/disease
29
Apply C1 on document P
C1(P)
C1 //patient ? //patient/pname
30
Apply C2 on document P
C2(P)
C2 //patient ? //patient//disease

Floating branch exact location unknown

31
Apply C3 on document P
C3(P)
C3 //patient/ward?//patient/disease
32
Apply a sequence of constraints ltC2,C3gt
C2 //patient ? //patient//disease C3
//patient/ward ? //patient/disease
33
Another user applies a different sequence of
constraints ltC3,C2gt
C2 //patient ? //patient//disease C3
//patient/ward ? //patient/disease

After applying C3, we cannot use C2 to expand the
tree
No more floating branch!

34
They look different!

P1 is m-contained in P2
There is a mapping from P1 to P2.
A floating branch can be mapped to a path.
The m-containing document P2 has more information
P2 is also m-contained in P1.
Thus they are m-equivalent!

35
What documents can users infer?

Different users can use different sequences of
constraints to do inference
Thus they can infer different documents
Questions
Can an inference process terminate?
What inferred document should we consider to
prevent leakage of sensitive data?

36
Theorem

Given a partial document P of an XML document D
and a set of constraints CC1,, Ck, there is a
document M that can be inferred from P using a
sequence of constraints, such that
for any sequence of constraints, its resulting
document is m-contained in M.
Can be computed using a greedy approach.
Such a document is unique under m-equivalence.

37
Information leakage

For a partial document P, if there exists a
regulating query A, such that the maximal
inferred document M can produce a non-empty
answer to the query A, then we say P causes
information leakage.

Partial Document P
Regulating query A
38
Roadmap

Information Leakage
? Prevent information leakage

39
Formal Problem

Given an XML document D, a regulating query A,
common knowledge represented as constraints
C1,,Ck
How to find a partial document P without
information leakage?
Called a valid partial document
The empty document is a trivial one
We want the published document to have as much
data as possible

40
An algorithm

We develop an algorithm for solving this problem
We use the running example to illustrate the
algorithm

41
Example
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice

S
Functional dependency //patient/ward ?
//patient/disease
42
Remove sensitive data A(D)
hospital
(2)
(3)
patient
(1)
patient
patient
patient
disease
Alice

S
Remaining document D - A(D)
43
Compute the maximal inferred document M of D-A(D)
hospital
(2)
(3)
patient
(1)
patient
patient
patient
disease
Alice

S
Maximal inferred document M
44
Testing Information Leakage
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice

S
There is a mapping from A to P. So information
leaked.
45
Computing a valid partial document
D - A(D)
A(D)
How to break the mappings? How to chase back the
inference steps?
46
AND/OR Graphs

A structure representing how a goal can be
reached by solving subproblems.
We use such graphs to formulate the process of
finding a valid partial document

47
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice

S
START

Consider mapping images of the leaf nodes in A
An OR connector shows that solving any of the
subproblems can solve the parent problem.

OR
(1)
(1)
Alice
leukemia
48
hospital
(2)
(3)
START
patient
(1)
patient
patient
OR
(1)
(1)
Alice
leukemia
AND
Regulating query A
OR
OR
patient
(1)
(2)
(3)
(3)
(2)
W305
leukemia
leukemia
W305
W305
disease
Alice

Multiple ways to infer the sensitive data.
An AND connector shows that solving ALL the
subproblems can solve the parent problem.

S
49
hospital
(2)
(3)
patient
(1)
patient
patient
Regulating query A
patient
disease
Alice

Continue expanding the AND/OR graph

S
50
AND/OR Graphs (cont)

A special START node representing the goal of
computing a valid partial document.
The graph has nodes corresponding to nodes in the
maximal inferred document M.
Such a node represents the subproblem of hiding
its corresponding node n in M
This node n should be removed from M
It cannot be inferred using the constraints and
other nodes in M.

51
Solution graphs

A connected subgraph (of M) including the START
node
For each node in the subgraph, its successor
connectors are also in the subgraph.
If it contains an OR connector, it must also
contain one of the connector's successors.
If it contains an AND connector, it must also
contain all the successors of the connector.

52
Example solution graphs
START
START
OR
OR
Alice
(1)
(1)
leukemia
AND
OR
OR
(1)
W305
53
Computing a valid partial document using a
solution graph

For a solution graph G, for each node in G, we
remove the corresponding node in M to get a valid
partial document

START
START
OR
OR
hospital
Alice
(1)
(1)
leukemia
(2)
(3)
patient
(1)
patient
patient
AND
OR
OR
(1)
W305
54
Constructing an AND/OR Graph

Give an algorithm for computing an AND/OR graph
Consider inference steps of different constraints
Many algorithms proposed on finding a solution
graph. They are applicable
No need to construct the entire AND/OR graph.
Search for a solution graph on the fly.

55
Related work
Different scenarios of database security based on
trust domains
Data Execution Query
A. Single-user DBMS
Data Execution
Query
B. C/S access control
C. Database as a service
Data Query
Execution
D. Data publishing (our work)
Query Execution
Data
56
Summary of 2nd paper

Formulated problem of publishing XML document
without information leakage due to data inference
Showed the effect of constraints on inference
Algorithm for finding a valid partial document of
a given document

Searching and Integrating Information on the Web - PowerPoint PPT Presentation

Searching and Integrating Information on the Web

Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine – PowerPoint PPT presentation