Title: Order Preserving Encryption for Numeric Data Rakesh Agrawal Jerry Kiernan Ramakrishnan Srikant Yirong Xu IBM Almaden Research Center
1Order Preserving Encryption for Numeric
DataRakesh AgrawalJerry KiernanRamakrishnan
SrikantYirong XuIBM Almaden Research Center
2Outline
- Motivation and Introduction
- OPES encryption
- Modeling the distribution
- Experimental evaluation
3Motivation
- Encryption is rapidly becoming a requirement in a
myriad of business settings (e.g., health care,
financial, retail, government), driven by
legislations (e.g. SB1386, HIPAA) - Encrypting databases unleashes a host of
problems - Performance slowdown
- Incompatibility with standard database features
- E.g. comparison predicates and the use of indexes
- Changes to applications for encryption
- Encryption functions now appear in queries
4Order Preserving Encryption Function
E is an order preserving encryption function, and
p1 and p2 are two plaintext values, and c1
E(p1) c2 E(p2)
if (p1 lt p2) then (c1 lt c2)
5Threat Model
- The storage system used by the DBMS is untrusted,
i.e. vulnerable to compromise - The DBMS software is trusted
- Ciphertext only attack
- The adversary has access to all (but only)
encrypted values - Guard against percentile exposure
- An adversary should not be able to get even an
estimate of true values
6Design Goals
- Query results from OPES will be sound and
complete - Comparison operations will be performed without
decrypting the operands - Standard database indexes can be used over
encrypted data - Tolerate updates
7Integration of Encryption and Query Processing
Users have a plaintext view of an encrypted
database
We hereafter strictly focus on the OPES algorithms
Comparison operators are directly applied over
encrypted columns
Queries
Plaintext queries are translated into equivalent
queries over encrypted data
Select name from Emp where sal gt 100000
Translation layer
Select decrypt (xsxx) from cwlxss where
xescs gt OPESencrypt(100000)
DBMS
Tables are encrypted using standard as well as
order preserving encryption
Encrypted data And metadata
8Outline
- Motivation and Introduction
- OPES encryption
- Modeling the distribution
- Experimental evaluation
9Approach
- Plaintext data has unknown distribution
- User selects the target (ciphertext) distribution
- Ciphertext values exhibit the target distribution
10Effect of OPES Encryption on Plaintext
Distributions
Input Gaussian, Target Zipf
Input Uniform, Target Zipf
11OPES Key Generation
Sample of source values from the plaintext
distribution
Sample of target values from the ciphertext
distribution
OPES Key Generation
OPES Key
12OPES Keys
Target to uniform
Target
Source to uniform
Uniform
Uniform
Source
13Two Step Encryption
- Source (plaintext) to uniform
- Uniform to target (ciphertext)
14OPES Encryption
Step II
Step I
Target
Uniform
Uniform
Source
Step II
Step I
Encrypt
Decrypt
15Outline
- Motivation and Introduction
- OPES encryption
- Modeling the distribution
- Experimental evaluation
16Modeling the Distribution
- Histograms
- Equi-depth, equi-width, wavelets
- Number of buckets required unreasonably large
- Over fitting the model
- Parametric
- Poor estimation for irregular distributions
- Hybrid Konig and Weikum 99
- Query result size estimation
- Approach
- Partition the data into buckets
- Model the distribution within a bucket as a
spline - Fixed number of buckets
17Our Approach
- Hybrid Konig and Weikum 99
- Partition the data into buckets
- Model the distribution within each bucket as a
linear spline - The number of buckets is not fixed
- We use MDL to determine the number of bucket
boundaries
18MDL
- The best model for encoding data minimizes the
sum of the cost of - Describing the model
- Describing data in terms of the model
19Model Costs
- Data Cost
- Using a mapping M from pl,ph) to fl,fh), the
cost of encoding pi is - C(pi)log(fi-E(i))
- DC(pl,ph) C(pl)C(pl1)C(ph-1)
- Incremental Model Cost
- Fixed cost for each additional bucket
- Boundary value
- Boundary parameters
- Slope
- Scale factor
20Computing Boundaries
- Growth phase
- pl,ph) with h-l-1 sorted points
pl1,pl2,,ph-1 - Compute spline for pl,ph)
- Compute fl,fh) using the spline
- Find further split point ps with fs having the
maximum deviation from the expected value - Prune phase
- LB(pl,ph)DC(pl,ph)-DC(pl,ps)-DC(ps,ph)-IMC
- GB(pl,ph)LB(pl,ph)GB(pl,ps)GB(ps,ph)
- if (GB gt 0), the split at ps is retained
21Scaling
Number of values in a bucket may be
disproportional to the size of the bucket
Uniform
x
x
x
x
x
Source
x
x
x
x
x
b
b1
b-1
22Updates
- The scale factor ensures that each distinct
plaintext value maps to distinct ciphertext
values - Encrypted values need not be recomputed unless
the distribution of plaintext values changes
23Quality of Encryption
- KS Statistical Test
- Can we disprove, to a certain required level of
significance, the null hypothesis that two data
sets are drawn from the same distribution
function? - If not, then the ciphertext distribution cannot
be distinguished from the specified target
distribution
24Duplicates
- Assumptions
- A large number of duplicates may leak information
about the distribution of values - Alternatively,
- Map duplicates to distinct values
- if (f M(p), f M(p1))
- f,f) M(p)
- Equality expressed as a range
- Equi-joins can no longer be expressed
- However, many numeric attributes (e.g., salary)
may rarely be used in joins
25Outline
- Motivation and Introduction
- OPES encryption
- Modeling the distribution
- Experimental evaluation
26Experimental Evaluation
- Percentile exposure
- Updatability
- Key size
- Time overhead
27Datasets
- Census
- UCI KDD archive, PUMS census data (30,000)
records - Gaussian
- Zipf
- Uniform
Default
Source Gaussian Target Zipf
28Percentile Exposure
Source distribution Target distribution Average change in percentile
Census Gaussian 37
Census Zipf 7
Census Uniform 38
Gaussian Zipf 45
Gaussian Uniform 17
Zipf Uniform 44
29Time to the Build Model
30Insertion Overhead
31Cost of Additional Insertion
32Retrieval Overhead
33Retrieval Time
34Related Work
- Polynomial functions
- Ignores the distribution of plaintext/ciphertext
values - Database as a service
- Requires post processing of query results
- Privacy homomorphisms
- Comparison operations not investigated
- Keyword searches on encrypted data
- Designed for keyword retrieval
- Range queries not supported
- Smartcard-based schemes
- Infeasible for large ranges
- Order-preserving hashing
- Protecting the hash values from cryptanalysis is
not a concern, nor is deciphering plaintext
values from hash values - Designed for static collections
35Closing Remarks
- Ensuring safety without impeding the flow of
information is a hard problem - Current choices
- Plaintext database
- Encrypted databases with loss of functionality or
performance - Our approach focused on the trade-off between
security and efficiency - We developed an algorithm which could easily be
integrated with current systems
36Backup
37Encode
Encode(p) z(sp2p) p c 0,ph), s q/(2r), z gt
0 distribution has density function qp r p is
the source (target) value s is the quadratic
coefficient z is the scale factor
38Decode
z ! z2 4zsf
Decode (f)
2zs
f c 0, fh), s q/(2r), z gt 0
f is the flattened value s is the quadratic
coefficient z is the scale factor
39Order Preserving Encryption
No Name Position Salary Location
Ciphertext is the index value
- Effectively hides the distribution of plaintext
values - The key size is proportional to the number of
distinct attribute values - Any updates require recomputing the key and
ciphertext values
Ciphertext Plaintext
1 28000
2 35000
Cn Pn
Compute distinct attribute values in ascending
order
40Target Distribution Requirement
- Why isnt the source-to-uniform transformation
sufficient for order preserving encryption? - It is, but
- The target distribution may cause an adversary to
make incorrect assumptions about the source
distribution - The organization of the source distribution
cannot be inferred from the target
41Quadratic Coefficient
x
x
x
x
x
x
x
x
x
x
v
b1
b2
i1
j1
i2
j2
j2 i2
j1 i1
-
vj2 vi2
vj1 vi1
q
q
s
vb1 vb2
j1 i1
2
vj1 vi1
42Scale Factor Constraints
for all p c 0,w) M(p1) M(p) o 2
Ensures that there is a distinct mapped value for
each input value
wf Kn
The width of a bucket in the mapped space is a
function of the number of elements n in the
bucket K is the minimum width needed across
buckets
43Scale Factor
The scale factor will stretch short buckets to
the width of the largest bucket, further
increasing the dimension of a bucket by a factor
of the number of elements in the bucket
Kn
z
sw2 w
K max x(swi2w), i 1, , m,
2, s o 0 2/(1 s(2w 1)), s lt 0
x
44Slope
The values within a single bucket are unevenly
distributed within the bucket
b-1
b