Title: Cryptographic methods for privacy aware computing: applications
1Cryptographic methods for privacy aware
computing applications
2Outline
- Review three basic methods
- Two applications
- Distributed decision tree with horizontally
partitioned data - Distributed k-means with vertically partitioned
data
3Three basic methods
- 1-out-K Oblivious Transfer
- Random share
- Homomorphic encryption
- Cost is the major concern
4Two example protocols
- The basic idea is
- Do not release original data
- Exchange intermediate result
- Applying the three basic methods to securely
combine them
5Building decision trees over horizontally
partitioned data
- Horizontally partitioned data
- Entropy-based information gain
- Major ideas in the protocol
6Horizontally Partitioned Data
- Table with key and r set of attributes
key
X1Xd
K1k2 kn
key
X1Xd
key
X1Xd
key
X1Xd
Ki1ki2 kj
Km1km2 kn
K1k2 ki
Site 1
Site 2
Site r
7Review decision tree algorithm (ID3 algorithm)
- Find the cut that maximizes gain
- certain attribute Ai, sorted v1vn
- Certain value in the attribute
- For categorical data we use Aivi
- For numerical data we use Ailtvi
Ailtvi?
yes
no
Ajltvj?
Ai
label
E() Entropy of label distribution
v1v2 vn
l1l2 ln
cut
Choose the attribute/value that gives the highest
gain!
8Key points
Ai
label
v1v2 vn
l1l2 ln
cut
- The key is calculating x log x, where
- x is the sum of values from the two parties
- P1 and P2 , i.e., x1 and x2, respectively
- decomposed to several steps
- Each step each party knows only a random
- share of the result
9steps
- Step1 compute shares for
- w1 w2 (x1x2)ln(x1x2)
-
- a major protocol is used to compute
ln(x1x2) - Step 2 for a condition (Ai, vi), find the random
shares for E(S), E(S1) and E(S2) respectively. -
- Step3 repeat step12 to all possible (Ai, vi)
pairs - Step4 a circuit gate to determine which
- (Ai, vi) pair results in maximum gain.
(Ai,vi) with Maximum gain
w11
w21
x1
w12
w22
x2
102. K-means over vertically partitioned data
- Vertically partitioned data
- Normal K-means algorithm
- Applying secure sum and secure comparison among
multi-sites in the secure distributed algorithm
11Vertically Partitioned Data
- Table with key and r set of attributes
key
X1Xi Xi1Xj Xm1Xd
key
X1Xi
key
Xi1Xj
key
Xm1Xd
Site 1
Site 2
Site r
12Motivation
- Naïve approach send all data to a trusted site
and do k-mean clustering there - Costly
- Trusted third party?
- Preferable distributed privacy preserving k-means
13Basic K-means algorithm
- 4 main steps
- step1.Randomly select k initial cluster centers
(k means) -
- repeat
- step 2. Assign any point i to its closest
cluster center - step 3. Recalculate the k means with the new
point assignment - Until step 4. the k means do not change
14Distributed k-means
- Why k-means can be done over vertically
partitioned data - All of the 4 steps are decomposable !
- The most costly part (step 2 and 3) can be done
locally - We will focus on the step 2 (Assign any point i
to its closest cluster center)
15 step 1
- All sites share the index of the initial random k
records as the centroids
µ11 µ1i
µ1i1 µ1j
µ1m µ1d
µ1
µk1 µki
µki1 µkj
µkm µkd
µk
Site 1
Site 2
Site r
16Step 2
- Assign any point x to its closest cluster center
- Calculate distance of point X (X1, X2, Xd) to
each cluster center µk - -- each distance calculation is decomposable!
- d2 (X1- µk1)2 (Xi- µki)2 (Xi1-
µki1)2 (Xj- µkj)2 -
-
-
-
-
-
- 2. Compare the k full distances to find the
minimum one
Partial distances d1 d2
Site1 site2
For each X, each site has a k-element vector that
is the result for the partial distance to the k
centroids, notated as Xi
17Privacy concerns for step 2
- Some concerns
- Partial distances d1, d2 may breach privacy
(the Xi and µki ) need to hide it - distance of a point to each cluster may breach
privacy need hide it - Basic ideas to ensure security
- Disguise the partial distances
- Compare distances so that only the comparison
result is learned - Permute the order of clusters so the real meaning
of the comparison results is unknown. - Need 3 non-colluding sites (P1, P2, Pr)
18Secure Computing of Step 2
- Stage1 prepare for secure sum of partial
distances - p1 generate V1V2 Vr 0, Vi is random
k-element vector, used to hide the partial
distance for site i - Use Homomorphic encryption to do randomization
- Ei(Xi)Ei(Vi) Ei(XiVi)
- Stage2 calculate secure sum for r-1 parties
- P1, P3, P4 Pr-1 send their perturbed and
permuted partial distances to Pr - Pr sums up the r-1 partial distances (including
its own part)
19Secure Computing of Step 2
Stage 1
Stage 2
Xi contains the partial distances to the k
partial centroids at site i Ei(Xi)Ei(Vi)
Ei(XiVi) Homomorphic encryption, Ei is public
key ?(Xi) permutation function, perturb the
order of elements in Xi V1V2 Vr 0, Vi is
used to hide the partial distances
20- Stage 3 secure_add_and_compare to find the
minimum distance - Involves only Pr and P2
- Use a standard Secure Multiparty Computation
protocol to find the result -
- Stage 4
- the index of minimum distance (permuted cluster
id) is sent back to P1. - P1 knows the permutation function thus knows the
original cluster id. - P1 broadcasts the cluster id to all parties.
K-1 comparisons
21Step 3 can also be done locally
- Update partial means µi locally according to the
new cluster assignments.
Cluster labels
X11 X1i
X1i1 X1j
X1m X1d
Cluster 2
X21 X2i
Cluster k
Xn1 Xni
Xni1 Xnj
Xnm Xnd
Cluster k
Site 1
Site 2
Site r
22Extra communication cost
- O(nrk)
- n of records
- r of parties
- k of means
- Also depends on of iterations
23Conclusion
- It is appealing to have cryptographic privacy
preserving protocols - The cost is the major concern
- It can be reduced using novel algorithms