A Privacy Preserving Index for Range Queries - PowerPoint PPT Presentation

About This Presentation
Title:

A Privacy Preserving Index for Range Queries

Description:

A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002] A client wants to ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 33
Provided by: Informat2091
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: A Privacy Preserving Index for Range Queries


1
A Privacy Preserving Index for Range Queries
  • Bijit Hore, Sharad Mehrotra, Gene Tsudik

2
Database as a Service (DAS) Hacigumus et. al,
SIGMOD2002
  • A client wants to store data on a remote server
    run queries on it
  • BUT he does not trust the server
  • Solution Encrypt the data store it
  • How do you query the encrypted data ?

Untrusted
Trusted
True Results
Encrypted Results
Query Post Processor
Encrypted Indexed Client Data
Server
Query Translator
Query over Encrypted Data
User
Original Query
Service Provider
Client
3
Data storage in DAS
Client side storage
Meta data
Server side data
buckets
Z0 Z1 Z2 Z3 Z4
0 200 450 600 650 700
Server side Table (encrypted indexed) RA
Original Table (plain text) R
etuple sharesA ageA salA
X_at_FJ X1 Y2 Z1
CH(G! X2 Y1 Z1
DL X3 Y2 Z2
GH) X3 Y3 Z3
eid name addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 423K
234 John River 6000 34 598K
780 Jerry Ocean 6200 48 632K
Bucket-tags
4
Querying in DAS
Select from R where R.sal ? 400K, 600K
Client-side query Server-side query
Select etuple from RA where RA.salA z1 ? z2
Server side Table (encrypted indexed) RA
Client side Table (plain text) R
Client side Table (plain text) R
etuple sharesA ageA salA
X_at_FJ X1 Y2 Z1
CH(G! X2 Y1 Z1
DL X3 Y2 Z2
GH) X3 Y3 Z3
eid name addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 426K
234 John River 6000 34 598K
780 Jerry Ocean 6200 48 634K
Bucket-tags
5
Issues in partitioning
  • How many buckets should one use ?
  • How to partition the data ?

6
Data Privacy in DAS
  • Adversary
  • Access to sever-side data
  • Malicious Intentions
  • Privacy issue in partitioned data
  • Small range of a bucket B
  • 1 sample value from B
  • Privacy goal of client
  • To hide all useful information from A
  • Put all values of an attribute in a single
    bucket !

Adversary (A)
Almost total disclosure of all elements in B
7
Research challenges our contributions
  • Precision how to partition data
  • Definition
  • Optimal partitioning to maximize precision
  • Privacy quantifying disclosure
  • Adversarys goals
  • Measures of information disclosure
  • Privacy-Precision trade-off
  • Controlled diffusion algorithm ?
  • Experiments Conclusion

Privacy
Precision
8
Precision of range queries
  • Given a partition of data into M parts
  • Precision (q) 1 ( false positives / tuples
    returned for q)
  • Recall 1
  • Workload All O(N2) range queries are
    equiprobable (uniform)

false positive a ? NBFB 532 518 250
B
Precision 1 20/50 0.6
q
M 2
10
10
Frequency
NB5,FB18
6
4
4
4
4
4
N 10 (domain size)
2
2
1 2 3 4 5 6
7 8 9 10
Salary (100Ks)
9
Query optimal buckets (QOB)
  • Optimization problem
  • For the uniform workload find a partition of the
    data into M buckets that minimizes total false
    positives i.e.

4
Minimize ? NBFB
B1
Optimal solution to a sub-problem
Cost of rightmost bucket
Cost(8,10)
QOB (1,7,3)
QOB (1,10,4)
10
10
Frequency
NBFB 24
6
4
4
4
4
4
N 10 (domain size)
2
2
1 2 3 4 5 6
7 8 9 10
Salary (100Ks)
10
QOB (cont.)
4
Optimal cost ?NBFB 123 202 102 83
110
1
B1
B2
B3
B4
10
10
6
Frequency
4
4
4
4
4
2
2
1 2 3 4 5 6
7 8 9 10
Salary(100Ks)
Time complexity O(n2M), Space O(nM) n
distinct values in dataset M buckets
11
Outline
  • Optimal data partitioning for range queries
  • Adversarial goals privacy measures
  • Balancing privacy and precision
  • Experiments conclusion

12
Adversarys learning model
  • Need to learn bucket properties to estimate
  • sensitive values
  • Model
  • As Domain knowledge
  • Sample values from buckets
  • Worst case assumption for Privacy Analysis
  • A knows exact value distribution for every bucket

A learns distribution of values in buckets
13
Adversarial Goal (I)
  • Individual Centric Information
  • Eg What is the salary of an individual I
  • Value Estimation Power (VEP) of A
  • Variance of bucket-distribution is an inverse
  • measure of VEP

Average error of value estimation for Adversary
Preferred Large variance
Small variance
Large
Small
Bucket range
Bucket range
14
Adversarial Goal (II)
  • Query Centric Information
  • Eg Which individuals have salary ? 100k,150k
  • Set Estimation Power (SEP) of A
  • Entropy of bucket-distribution is an inverse
  • measure of SEP

Best case high entropy large variance
Average error of query-set estimation for
Adversary
low entropy large variance
Large
Small
100k
150k
100k
150k
H(X) - ? pilogpi
Bucket range
Bucket range
15
Outline
  • Optimal data partitioning for range queries
  • Adversarial goals privacy measures
  • Balancing privacy and precision
  • Experiments conclusion

16
Privacy-Precision Trade-off
  • Optimal buckets might offer less privacy than
    desired
  • Small variance ?
  • partial disclosure of numeric value
  • Small entropy ?
  • Total disclosure with high probability (e.g.
    categorical data)
  • Partial detection of query-sets (for all cases)
  • Algorithm that allows trading-off bounded amount
    of query precision for greater variance and
    entropy

Objective
17
The controlled diffusion algorithm
  • A simple observation

Q
  • Let a query Q overlap only with B0
  • If elements of B0 are distributed
  • into CB1, CB2 CB3 randomly
  • Now Q overlaps with CB1, CB2 CB3
  • With new buckets, the precision for Q drops by
    factor of
  • (CB1CB2CB3) / B0
  • Any re-distribution scheme where ? Bi this ratio
    K ? precision degradation is bounded above by K

B0
CB1
CB2
CB3
18
Controlled diffusion Algorithm
  • Compute optimal buckets on data set D ? B1 BM
  • Fix max degradation factor K
  • Initialize M empty composite buckets ? CB1 CBM
  • Set target size of each CB to
  • fCB D/M (equidepth)
  • ? Bi
  • select di CBs at random, where
  • di KBi/fCB
  • Diffuse elements of Bi into these uniformly at
    random

19
Controlled Diffusion (Example)
Degradation factor k 2
Query optimal buckets
Metadata size increases from O(M) to O(KM)
10
10
10
Freq
B1
B2
B3
B4
6
Final set of buckets on server
4
4
4
4
4
2
2
1 2 3 4 5 6 7 8 9
10
2 4 2
2 2 2
Values
CB1
4 2 2 3
CB1
CB2
CB2
2 2 2 3 4
CB3
CB3
3 4 2 3
CB4
CB4
1 2 3 4 5 6 7 8 9
10
Composite Buckets
20
Some features of the diffusion algorithm
  • Many consecutive optimal buckets might get
    diffused into common set of CBs ?
  • Observed precision degradation lt K
  • Elements with same values can go to multiple
    buckets ?
  • Giving it an extra degree of freedom compared to
    hashing
  • Not best for point queries
  • Random choice in the algorithm ?
  • Each bucket distribution approaches data
    distribution as K increases ? reducing
    information gained by adversary by learning
    buckets

21
Outline
  • Optimal data partitioning for range queries
  • Adversarial goals privacy measures
  • Balancing privacy and precision
  • Experiments conclusion

22
Experiments
  • Data sets
  • Synthetic Data 105 Integers in 0,999
    uniformly at random
  • Real Data 104 Real values in -0.8,8.0 Corel
    Image dataset (UCI KDD archive)
  • Query workloads (2 of size 104 each)
  • End points chosen uniformly at random from the
    respective ranges

23
  1. Relative decrease in precision of composite
    buckets
  2. Relative increase in standard deviation in
    composite buckets
  3. Relative increase in entropy in composite buckets

24
Composite buckets (sample)
K 6, M 350
K 10, M 250
25
  • Visualizing trade-offs for various bucketization
    parameters
  • Eg The marked points show the average entropy
    precision we get for 100 buckets degradation
    factor of 2
  • The same point in the precision vs standard
    deviation trade-off space ?
  • Provides an easy way to visualize the design
    space and choose parameters of interest

26
Summary
  • An optimal algorithm for partitioning data for
    range queries
  • Statistical measures of data privacy
  • Variance
  • Entropy
  • Fast simple algorithm for re-bucketizing data
  • Bounded amount of precision degradation
  • Substantial increase in privacy level

27
Related work
  • Hacigumus et. al, SIGMOD 2002, Executing SQL
    over Encrypted Data in the Database Service
    Provider Model.
  • Damiani et. al, ACM CCS 2003, Balancing
    Confidentiality and Efficiency in Untrusted
    Relation DBMS.
  • Bouganim et. al, VLDB 2002 Chip-Secured Data
    Access Confidential Data on Untrusted Servers.

28
THANK YOU !
Questions ?
29
Privacy in DAS
  • Here goal of Data Privacy is not just ensuring
    non-disclosure of identity. It is more general !

Privacy-preserving DM Statistical DB
DAS
  • Privacy criteria Hide as much information as
    possible (even at the aggregate level)
  • Utility criteria Maintain only the necessary
    information required for server-side query
    evaluation (at desired degree of accuracy)
  • Privacy criteria Protect against disclosure of
    identity
  • Utility criteria Minimizing information loss
    i.e. maximize utility for data miners, retain as
    much aggregate level information as possible

30
Individual Privacy Measure
  • Average Squared Error of Estimation (ASEE)
  • Error in approximating true value of a r.v XB by
  • another r.v XB (learned by A)
  • ASEE(XB,XB)
  • Var(XB) Var(XB) (E(XB) E(XB))2
  • Variance of bucket distribution, Var(XB) is our
  • measure of individual privacy (lower bound)

31
Set oriented Privacy Measure
  • Entropy of bucket distribution is our measure
    for query-centric privacy
  • Measures uncertainty associated with a r.v (Eg.
    True class of an element for categorical data)
  • An inverse measure of the quality of partial
    solution sets that A can derive for a query

H(X) - ? pilogpi
32
Meta data size increase in diffusion
  • The meta data increases from O(M) to
  • KB1/fcb KB2/fcb KBM/fcb
  • (K/fcb) (B1 B2 BM)
  • (KM/D)D O(KM)
Write a Comment
User Comments (0)
About PowerShow.com