A Privacy Preserving Index for Range Queries - PowerPoint PPT Presentation

About This Presentation

Title:

A Privacy Preserving Index for Range Queries

Description:

A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002] A client wants to ... – PowerPoint PPT presentation

Number of Views:201

Avg rating:3.0/5.0

Slides: 33

Provided by: Informat2091

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Privacy Preserving Index for Range Queries

1
A Privacy Preserving Index for Range Queries

Bijit Hore, Sharad Mehrotra, Gene Tsudik

2
Database as a Service (DAS) Hacigumus et. al,
SIGMOD2002

A client wants to store data on a remote server
run queries on it
BUT he does not trust the server
Solution Encrypt the data store it
How do you query the encrypted data ?

Untrusted
Trusted
True Results
Encrypted Results
Query Post Processor
Encrypted Indexed Client Data
Server
Query Translator
Query over Encrypted Data
User
Original Query
Service Provider
Client
3
Data storage in DAS
Client side storage
Meta data
Server side data
buckets
Z0 Z1 Z2 Z3 Z4
0 200 450 600 650 700
Server side Table (encrypted indexed) RA
Original Table (plain text) R
etuple sharesA ageA salA
X_at_FJ X1 Y2 Z1
CH(G! X2 Y1 Z1
DL X3 Y2 Z2
GH) X3 Y3 Z3
eid name addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 423K
234 John River 6000 34 598K
780 Jerry Ocean 6200 48 632K
Bucket-tags
4
Querying in DAS
Select from R where R.sal ? 400K, 600K
Client-side query Server-side query
Select etuple from RA where RA.salA z1 ? z2
Server side Table (encrypted indexed) RA
Client side Table (plain text) R
Client side Table (plain text) R
etuple sharesA ageA salA
X_at_FJ X1 Y2 Z1
CH(G! X2 Y1 Z1
DL X3 Y2 Z2
GH) X3 Y3 Z3
eid name addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 426K
234 John River 6000 34 598K
780 Jerry Ocean 6200 48 634K
Bucket-tags
5
Issues in partitioning

How many buckets should one use ?
How to partition the data ?

6
Data Privacy in DAS

Adversary
Access to sever-side data
Malicious Intentions
Privacy issue in partitioned data
Small range of a bucket B
1 sample value from B
Privacy goal of client
To hide all useful information from A
Put all values of an attribute in a single
bucket !

Adversary (A)
Almost total disclosure of all elements in B
7
Research challenges our contributions

Precision how to partition data
Definition
Optimal partitioning to maximize precision
Privacy quantifying disclosure
Adversarys goals
Measures of information disclosure
Privacy-Precision trade-off
Controlled diffusion algorithm ?
Experiments Conclusion

Privacy
Precision
8
Precision of range queries

Given a partition of data into M parts
Precision (q) 1 ( false positives / tuples
returned for q)
Recall 1
Workload All O(N2) range queries are
equiprobable (uniform)

false positive a ? NBFB 532 518 250
B
Precision 1 20/50 0.6
q
M 2
10
10
Frequency
NB5,FB18
6
4
4
4
4
4
N 10 (domain size)
2
2
1 2 3 4 5 6
7 8 9 10
Salary (100Ks)
9
Query optimal buckets (QOB)

Optimization problem
For the uniform workload find a partition of the
data into M buckets that minimizes total false
positives i.e.

4
Minimize ? NBFB
B1
Optimal solution to a sub-problem
Cost of rightmost bucket
Cost(8,10)
QOB (1,7,3)
QOB (1,10,4)
10
10
Frequency
NBFB 24
6
4
4
4
4
4
N 10 (domain size)
2
2
1 2 3 4 5 6
7 8 9 10
Salary (100Ks)
10
QOB (cont.)
4
Optimal cost ?NBFB 123 202 102 83
110
1
B1
B2
B3
B4
10
10
6
Frequency
4
4
4
4
4
2
2
1 2 3 4 5 6
7 8 9 10
Salary(100Ks)
Time complexity O(n2M), Space O(nM) n
distinct values in dataset M buckets
11
Outline

Optimal data partitioning for range queries
Adversarial goals privacy measures
Balancing privacy and precision
Experiments conclusion

12
Adversarys learning model

Need to learn bucket properties to estimate
sensitive values
Model
As Domain knowledge
Sample values from buckets
Worst case assumption for Privacy Analysis
A knows exact value distribution for every bucket

A learns distribution of values in buckets
13
Adversarial Goal (I)

Individual Centric Information
Eg What is the salary of an individual I
Value Estimation Power (VEP) of A
Variance of bucket-distribution is an inverse
measure of VEP

Average error of value estimation for Adversary
Preferred Large variance
Small variance
Large
Small
Bucket range
Bucket range
14
Adversarial Goal (II)

Query Centric Information
Eg Which individuals have salary ? 100k,150k
Set Estimation Power (SEP) of A
Entropy of bucket-distribution is an inverse
measure of SEP

Best case high entropy large variance
Average error of query-set estimation for
Adversary
low entropy large variance
Large
Small
100k
150k
100k
150k
H(X) - ? pilogpi
Bucket range
Bucket range
15
Outline

Optimal data partitioning for range queries
Adversarial goals privacy measures
Balancing privacy and precision
Experiments conclusion

16
Privacy-Precision Trade-off

Optimal buckets might offer less privacy than
desired
Small variance ?
partial disclosure of numeric value
Small entropy ?
Total disclosure with high probability (e.g.
categorical data)
Partial detection of query-sets (for all cases)
Algorithm that allows trading-off bounded amount
of query precision for greater variance and
entropy

Objective
17
The controlled diffusion algorithm

A simple observation

Let a query Q overlap only with B0
If elements of B0 are distributed
into CB1, CB2 CB3 randomly
Now Q overlaps with CB1, CB2 CB3
With new buckets, the precision for Q drops by
factor of
(CB1CB2CB3) / B0
Any re-distribution scheme where ? Bi this ratio
K ? precision degradation is bounded above by K

B0
CB1
CB2
CB3
18
Controlled diffusion Algorithm

Compute optimal buckets on data set D ? B1 BM
Fix max degradation factor K
Initialize M empty composite buckets ? CB1 CBM
Set target size of each CB to
fCB D/M (equidepth)
? Bi
select di CBs at random, where
di KBi/fCB
Diffuse elements of Bi into these uniformly at
random

19
Controlled Diffusion (Example)
Degradation factor k 2
Query optimal buckets
Metadata size increases from O(M) to O(KM)
10
10
10
Freq
B1
B2
B3
B4
6
Final set of buckets on server
4
4
4
4
4
2
2
1 2 3 4 5 6 7 8 9
10
2 4 2
2 2 2
Values
CB1
4 2 2 3
CB1
CB2
CB2
2 2 2 3 4
CB3
CB3
3 4 2 3
CB4
CB4
1 2 3 4 5 6 7 8 9
10
Composite Buckets
20
Some features of the diffusion algorithm

Many consecutive optimal buckets might get
diffused into common set of CBs ?
Observed precision degradation lt K
Elements with same values can go to multiple
buckets ?
Giving it an extra degree of freedom compared to
hashing
Not best for point queries
Random choice in the algorithm ?
Each bucket distribution approaches data
distribution as K increases ? reducing
information gained by adversary by learning
buckets

21
Outline

Optimal data partitioning for range queries
Adversarial goals privacy measures
Balancing privacy and precision
Experiments conclusion

22
Experiments

Data sets
Synthetic Data 105 Integers in 0,999
uniformly at random
Real Data 104 Real values in -0.8,8.0 Corel
Image dataset (UCI KDD archive)
Query workloads (2 of size 104 each)
End points chosen uniformly at random from the
respective ranges

Relative decrease in precision of composite
buckets
Relative increase in standard deviation in
composite buckets
Relative increase in entropy in composite buckets

24
Composite buckets (sample)
K 6, M 350
K 10, M 250
25

Visualizing trade-offs for various bucketization
parameters
Eg The marked points show the average entropy
precision we get for 100 buckets degradation
factor of 2
The same point in the precision vs standard
deviation trade-off space ?
Provides an easy way to visualize the design
space and choose parameters of interest

26
Summary

An optimal algorithm for partitioning data for
range queries
Statistical measures of data privacy
Variance
Entropy
Fast simple algorithm for re-bucketizing data
Bounded amount of precision degradation
Substantial increase in privacy level

27
Related work

Hacigumus et. al, SIGMOD 2002, Executing SQL
over Encrypted Data in the Database Service
Provider Model.
Damiani et. al, ACM CCS 2003, Balancing
Confidentiality and Efficiency in Untrusted
Relation DBMS.
Bouganim et. al, VLDB 2002 Chip-Secured Data
Access Confidential Data on Untrusted Servers.

28
THANK YOU !
Questions ?
29
Privacy in DAS

Here goal of Data Privacy is not just ensuring
non-disclosure of identity. It is more general !

Privacy-preserving DM Statistical DB
DAS

Privacy criteria Hide as much information as
possible (even at the aggregate level)
Utility criteria Maintain only the necessary
information required for server-side query
evaluation (at desired degree of accuracy)

Privacy criteria Protect against disclosure of
identity
Utility criteria Minimizing information loss
i.e. maximize utility for data miners, retain as
much aggregate level information as possible

30
Individual Privacy Measure

Average Squared Error of Estimation (ASEE)
Error in approximating true value of a r.v XB by
another r.v XB (learned by A)
ASEE(XB,XB)
Var(XB) Var(XB) (E(XB) E(XB))2
Variance of bucket distribution, Var(XB) is our
measure of individual privacy (lower bound)

31
Set oriented Privacy Measure

Entropy of bucket distribution is our measure
for query-centric privacy
Measures uncertainty associated with a r.v (Eg.
True class of an element for categorical data)
An inverse measure of the quality of partial
solution sets that A can derive for a query

H(X) - ? pilogpi
32
Meta data size increase in diffusion