Privacy protection in published data using an efficient clustering method - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Privacy protection in published data using an efficient clustering method

Description:

qualifying exam presentation by Murshed. Privacy protection in published data using an efficient clustering method – PowerPoint PPT presentation

Number of Views:269

Avg rating:3.0/5.0

Slides: 27

Provided by: Md57

Category:

more less

Transcript and Presenter's Notes

Title: Privacy protection in published data using an efficient clustering method

1
Privacy protection in published data using an
efficient clustering method
ICT 2010 presentation

Presented By Md. Manzoor MurshedFriday, October
23, 2015

2
Overview of the Presentation

Introduction
Re-identification of Data
k-anonymity Model
MOKA Algorithm
Experimental results
Future Work
Conclusion
Question?

3
An Abundance of Data

Supermarket scanners
Credit card transactions
Call center records
ATM machines
Web server logs
Customer web site trails
Podcasts
Blogs
Closed caption

Scientific experiments
Sensors, Cameras
Hospital visits
Social Networks
Facebook, Myspace
Twitter
Speech-to-text translation
Email
Education Institute
Travel records

Print, film, optical, and magnetic storage 5
Exabytes (EB) of new information in 2002, doubled
in the last three years How much Information
2003, UC Berkeley
4
Data Holders Publish SensitiveInformation to
Facilitate Research.

Publish information that
Discloses as much statistical information as
possible.
discover valid, novel, potentially useful, and
ultimately understandable patterns in data
Preserves the privacy of the individuals
contributing the data.

5
Question?

How do you publicly release a database without
compromising individual privacy?

The Wrong Approach
Just leave out any unique identifiers like name
and SSN and hope that this works.
Why?
The triple (DOB, gender, zip code) suffices to
uniquely identify at least 87 of US citizens in
publicly available databases (1990 U.S. Census
summary data).
Moral Any real privacy guarantee must be proved
and established mathematically.

6
Examples of Re-identification Attempts
Examples Health-specific and General Examples of Re-identification
AOL search data Researchers were capable of revealing sensitive details of the participants private lives, such as Social Security numbers, credit-card numbers, addresses etc. from the anonymized AOL Internet search data that contains health related searches as well 8.
Chicago homicide database A large percentage of individuals were re-identified easily by linking the Chicago homicide database with the social security death index.
Netflix movie recommendations Several individuals were re-identified from the publicly available anonymized Netflix movie recommendations database by linking their anonymized movie ratings with ratings in a publicly available Internet movie rating web site 9.
Re-identification of the medical record Massachusetts governors sensitive medical records was re-identified by linking the anonymized data of the Group Insurance Commission, which purchases health insurance for state employees, with the voter list for Cambridge 1.
Southern Illinoisan vs. The Department of Public Health Individuals in a neuroblastoma data set from the Illinois cancer registry was re-identified with a very high accuracy 4.
Canadian Adverse Event Database An unfortunate death of a 26 year-old student by taking a particular drug was re-identified from the publicly released adverse drug reaction database of Health Canada 4.
7
AOL Data Release

AOL anonymously released a list of 21 million
web searchqueries.
UserIDs were replaced by random numbers

8
A Face Is Exposed for AOL Searcher No.
4417749New York Times, August 9, 2006

No. 4417749 conducted hundreds of searches over a
threemonth
period on topics ranging from numb fingers to
60
single men to dog that urinates on everything.
And search by search, click by click, the
identity of AOL user No.
4417749 became easier to discern. There are
queries for
landscapers in Lilburn, Ga, several people with
the last
name Arnold and homes sold in shadow lake
subdivision
gwinnett county georgia.
It did not take much investigating to follow that
data trail to
Thelma Arnold, a 62-year-old widow who lives in
Lilburn, Ga.,
frequently researches her friends medical
ailments and loves
her three dogs. Those are my searches, she
said, after a
reporter read part of the list to her.

9
Re-identification of AOL data release
Ms. Arnold says she loves online research, but
the disclosure of her searches has left her
disillusioned. In response, she plans to drop her
AOL subscription. We all have a right to
privacy, she said. Nobody should have found
this all out.
Source http//data.aolsearchlogs.com
10
Re-identification by linking

NAHDO reported that 37 states have legislative
mandates to collect hospital level data
GIC is responsible for purchasing health insurance

Medical Data was considered anonymous, since
identifying attributes were removed.
Governor of Massachusetts, was uniquely
identified by the attributes Zip Birth Date Sex
Hence, his private medical records were out in
the open
11
Re-identification by linking (Example)
Hospital Patient Data
Vote Registration Data
DOB Gender Zipcode Disease
1/21/76 Male 53715 Heart Disease
4/13/86 Female 53715 Hepatitis
2/28/76 Male 53703 Brochitis
1/21/76 Male 53703 Broken Arm
4/13/86 Female 53706 Flu
2/28/76 Female 53706 Hang Nail
Name DOB Gender Zipcode
Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410
Carol 10/1/44 Female 90210
Dan 2/21/84 Male 02174
Ellen 4/19/72 Female 02237

Andre has heart disease!

12
Data Publishing and Data Privacy

Society is experiencing exponential growth in the
number and variety of data collections containing
person-specific information.
These collected information is valuable both in
research and business. Data sharing is common.
Publishing the data may put the respondents
privacy in risk.
Objective
Maximize data utility while limiting disclosure
risk to an acceptable level

13
What is Privacy?
The claim of individuals, groups, or
institutions to determine for themselves when,
how and to what extent information about them is
communicated to others Westin, Privacy and
Freedom, 1967
But we need quantifiable notions of privacy
... nothing about an individual should be
learnable from the database that cannot be
learned without access to the database T.
Dalenius, 1977
14
Quality versus anonymity
15
Related Works

Statistical Databases
Adding noise maintaining some statistical
invariant.
Disadvantages
destroy the integrity of the data
Multi-level Databases
Data is stored at different security
classifications and users having different
security clearances.
Restrict the release of lower classified
information Eliminate precise inference.
Disadvantages
It is impossible to consider every possible
attack
Suppression can drastically reduce the quality of
the data.

16
K-Anonymity

Sweeny came up with a formal protection model
named k-anonymity
What is K-Anonymity?
If the information for each person contained in
the release cannot be distinguished from at least
k-1 individuals whose information also appears in
the release.
Ex.
If you try to identify a man from a release, but
the only information you have is his birth date
and gender. There are k people meet the
requirement. This is k-Anonymity.

17
Example of suppression and generalization
The following database
Can be 2-anonymized as follows
First Last Age Race
Isvik 20-40 American
John 36 Hisp
Isvik 20-40 Hisp
John 36 Hisp
Clark 40-50 Cauc
Clark 40-50 Cauc
First Last Age Race
Lynn Isvik 34 American
John Siblic 36 Hisp
Erin Isvik 25 Hisp
John Deer 40 Hisp
Victor Clark 50 Cauc
Danial Clark 40 Cauc

Rows 1 and 3 are identical, rows 2 and 4 are
identical, rows 4 and 5 are identical.
Suppression can replace individual attributes
with a
Generalization replace individual attributes with
a border category

2015/10/23
17
18
K-Anonymity Protection Model
Definition 1 (Quasi-identifier) A set of non
sensitive attributesQ1, . . . ,Qw of a table
that can be linked with external data to
uniquely identify at least one individual from
the general population are known as
Quasi-identifier.
Definition 2 (k-anonymity requirement) Each
release of data must be such that every
combination of values of quasi-identifiers can be
indistinctly matched to at least k respondents.
Definition 3 (k-Anonymity) A table T satisfies
k-anonymity if for all t in T , there exists
(k-1) other tuples ti1, ti2 , . . . , tik-1 in
T such that tCti1 C ti2 C
tik-1 C, for all C in QI.
19
Metrics used for the Algorithm
m numeric quasi-identifiers N1, N2, Nm and q
categorical quasi-identifiers C1, C2, Cq.
Information Loss L(Pi) Pi D (Pi)
20
Pseudocode of the MOKA Algorithm
//clustering state Sort all the records in the
table using the non-sensitive attributes Set the
number of clusters K number of records in the
table / k value of k-anonymity Remove and set
every kth record of the table as the starting
record of each cluster Find and assign rest of
the records of the table to its nearest
cluster //adjusting stage Find the clusters G
that has records greater than k Sort the records
of the G Remove and assign all the records of G
in R that are greater than kth location Find and
assign every records of R to the closest cluster
of size less than k
21
(MOKA) algorithm
22
Experimental results
23
Future work

l-diversity
Homogeneity Attack
Background knowledge attack
t-closeness
Skewness attack

24
Conclusion

The k-anonymity protection model can prevent
identity disclosure but lack of diversity of the
sensitive values attribute breaks the protection
mechanism.
Clustering similar kind of data together before
anonymization can lower the information loss due
to generalization.
In this research we propose a modified clustering
method for k-anonymization.
We compare our algorithm with k-means algorithm
and got less information loss for some cases.
we are planning to change some parameters of our
algorithm and would like to check the performance
with other similar algorithm.

25
Questions?

Thank you!

26
References

Sweeney, k-anonymity a model for protecting
privacy, International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, 2002.
Jun-Lin Lin, Meng-Cheng Wei, An efficient
clustering method for k-anonymization.
Proceedings of the 2008 international workshop on
Privacy and anonymity in information society,
ACM, PAIS 2008.
Jun-Lin Lin, Meng-Cheng Wei, Chih-Wen Li,
Kuo-Chiang Hsieh, A hybrid Method for
k-anonymization, Proceedings of the 2008 IEEE
Asia-Pacific Services Computing Conference, APSCC
2008, pp.385-390.
Khaled El Emam, Fida Kamal Dankar, Protecting
Privacy Using K-anonymity, Journal of the
American Medical Informatics Association Volume
15 Number 5 September / October 2008.
Sweeney, Computational Disclosure Control A
primer on data privacy Protection, PhD thesis,
Massachusetts Institute of Technology 2001.
Kristen LeFevre, David J. DeWitt, Raghu
Ramakrishanan, Incognito Efficient Full Domain
k-anonymity, Proceedings of SIGMOD 2005 June
14-16, 2005, Baltimore, Maryland, USA, pp. 49-60.
Kristen LeFevre, David J. DeWitt, Raghu
Ramakrishanan, Mondrian Multidimensional
K-Anonymity, technical report of University of
Wisconsin, Madison.
Robert Lemos, Researchers reverse Netflix
anonymization, SecurityFocus 2007-12-04,
(http//www.privacyanalytics.ca/news/netflix.pdf).
S. Hettich and S. D. Bay. The UCI KDD Archive,
1999, http//kdd.ics.uci.edu
Ashwin Machanavajjhala, Johannes Gehrke, Daniel
Kifer, Muthuramakrishana Venkitasubramaniam .
l-diversity Privacy Beyond k-Anonymity.IEEE
Internationl Conference on Data Engineering,
2006.
Ninghui Li, Tiancheng Li, Suresh
Venkatasubramanium, t-Closeness Privacy Beyond
k-Anonymity and l-diversity, Proceedings of IEEE
23rd Int'l Conference on Data Engineering (ICDE)
2007.