Privacy protection in published data using an efficient clustering method - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Privacy protection in published data using an efficient clustering method

Description:

qualifying exam presentation by Murshed. Privacy protection in published data using an efficient clustering method – PowerPoint PPT presentation

Number of Views:263
Avg rating:3.0/5.0
Slides: 27
Provided by: Md57
Category:

less

Transcript and Presenter's Notes

Title: Privacy protection in published data using an efficient clustering method


1
Privacy protection in published data using an
efficient clustering method
ICT 2010 presentation
  • Presented By Md. Manzoor MurshedFriday, October
    23, 2015

2
Overview of the Presentation
  • Introduction
  • Re-identification of Data
  • k-anonymity Model
  • MOKA Algorithm
  • Experimental results
  • Future Work
  • Conclusion
  • Question?

3
An Abundance of Data
  • Supermarket scanners
  • Credit card transactions
  • Call center records
  • ATM machines
  • Web server logs
  • Customer web site trails
  • Podcasts
  • Blogs
  • Closed caption
  • Scientific experiments
  • Sensors, Cameras
  • Hospital visits
  • Social Networks
  • Facebook, Myspace
  • Twitter
  • Speech-to-text translation
  • Email
  • Education Institute
  • Travel records

Print, film, optical, and magnetic storage 5
Exabytes (EB) of new information in 2002, doubled
in the last three years How much Information
2003, UC Berkeley
4
Data Holders Publish SensitiveInformation to
Facilitate Research.
  • Publish information that
  • Discloses as much statistical information as
    possible.
  • discover valid, novel, potentially useful, and
    ultimately understandable patterns in data
  • Preserves the privacy of the individuals
    contributing the data.

5
Question?
  • How do you publicly release a database without
    compromising individual privacy?
  • The Wrong Approach
  • Just leave out any unique identifiers like name
    and SSN and hope that this works.
  • Why?
  • The triple (DOB, gender, zip code) suffices to
    uniquely identify at least 87 of US citizens in
    publicly available databases (1990 U.S. Census
    summary data).
  • Moral Any real privacy guarantee must be proved
    and established mathematically.

6
Examples of Re-identification Attempts
Examples Health-specific and General Examples of Re-identification
AOL search data Researchers were capable of revealing sensitive details of the participants private lives, such as Social Security numbers, credit-card numbers, addresses etc. from the anonymized AOL Internet search data that contains health related searches as well 8.
Chicago homicide database A large percentage of individuals were re-identified easily by linking the Chicago homicide database with the social security death index.
Netflix movie recommendations Several individuals were re-identified from the publicly available anonymized Netflix movie recommendations database by linking their anonymized movie ratings with ratings in a publicly available Internet movie rating web site 9.
Re-identification of the medical record Massachusetts governors sensitive medical records was re-identified by linking the anonymized data of the Group Insurance Commission, which purchases health insurance for state employees, with the voter list for Cambridge 1.
Southern Illinoisan vs. The Department of Public Health Individuals in a neuroblastoma data set from the Illinois cancer registry was re-identified with a very high accuracy 4.
Canadian Adverse Event Database An unfortunate death of a 26 year-old student by taking a particular drug was re-identified from the publicly released adverse drug reaction database of Health Canada 4.
7
AOL Data Release
  • AOL anonymously released a list of 21 million
    web searchqueries.
  • UserIDs were replaced by random numbers

8
A Face Is Exposed for AOL Searcher No.
4417749New York Times, August 9, 2006
  • No. 4417749 conducted hundreds of searches over a
    threemonth
  • period on topics ranging from numb fingers to
    60
  • single men to dog that urinates on everything.
  • And search by search, click by click, the
    identity of AOL user No.
  • 4417749 became easier to discern. There are
    queries for
  • landscapers in Lilburn, Ga, several people with
    the last
  • name Arnold and homes sold in shadow lake
    subdivision
  • gwinnett county georgia.
  • It did not take much investigating to follow that
    data trail to
  • Thelma Arnold, a 62-year-old widow who lives in
    Lilburn, Ga.,
  • frequently researches her friends medical
    ailments and loves
  • her three dogs. Those are my searches, she
    said, after a
  • reporter read part of the list to her.

9
Re-identification of AOL data release
Ms. Arnold says she loves online research, but
the disclosure of her searches has left her
disillusioned. In response, she plans to drop her
AOL subscription. We all have a right to
privacy, she said. Nobody should have found
this all out.
Source http//data.aolsearchlogs.com
10
Re-identification by linking
  • NAHDO reported that 37 states have legislative
    mandates to collect hospital level data
  • GIC is responsible for purchasing health insurance

Medical Data was considered anonymous, since
identifying attributes were removed.
Governor of Massachusetts, was uniquely
identified by the attributes Zip Birth Date Sex
Hence, his private medical records were out in
the open
11
Re-identification by linking (Example)
Hospital Patient Data
Vote Registration Data
DOB Gender Zipcode Disease
1/21/76 Male 53715 Heart Disease
4/13/86 Female 53715 Hepatitis
2/28/76 Male 53703 Brochitis
1/21/76 Male 53703 Broken Arm
4/13/86 Female 53706 Flu
2/28/76 Female 53706 Hang Nail
Name DOB Gender Zipcode
Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410
Carol 10/1/44 Female 90210
Dan 2/21/84 Male 02174
Ellen 4/19/72 Female 02237
  • Andre has heart disease!

12
Data Publishing and Data Privacy
  • Society is experiencing exponential growth in the
    number and variety of data collections containing
    person-specific information.
  • These collected information is valuable both in
    research and business. Data sharing is common.
  • Publishing the data may put the respondents
    privacy in risk.
  • Objective
  • Maximize data utility while limiting disclosure
    risk to an acceptable level

13
What is Privacy?
The claim of individuals, groups, or
institutions to determine for themselves when,
how and to what extent information about them is
communicated to others Westin, Privacy and
Freedom, 1967
But we need quantifiable notions of privacy
... nothing about an individual should be
learnable from the database that cannot be
learned without access to the database T.
Dalenius, 1977
14
Quality versus anonymity
15
Related Works
  • Statistical Databases
  • Adding noise maintaining some statistical
    invariant.
  • Disadvantages
  • destroy the integrity of the data
  • Multi-level Databases
  • Data is stored at different security
    classifications and users having different
    security clearances.
  • Restrict the release of lower classified
    information Eliminate precise inference.
  • Disadvantages
  • It is impossible to consider every possible
    attack
  • Suppression can drastically reduce the quality of
    the data.

16
K-Anonymity
  • Sweeny came up with a formal protection model
    named k-anonymity
  • What is K-Anonymity?
  • If the information for each person contained in
    the release cannot be distinguished from at least
    k-1 individuals whose information also appears in
    the release.
  • Ex.
  • If you try to identify a man from a release, but
    the only information you have is his birth date
    and gender. There are k people meet the
    requirement. This is k-Anonymity.

17
Example of suppression and generalization
The following database
Can be 2-anonymized as follows
First Last Age Race
Isvik 20-40 American
John 36 Hisp
Isvik 20-40 Hisp
John 36 Hisp
Clark 40-50 Cauc
Clark 40-50 Cauc
First Last Age Race
Lynn Isvik 34 American
John Siblic 36 Hisp
Erin Isvik 25 Hisp
John Deer 40 Hisp
Victor Clark 50 Cauc
Danial Clark 40 Cauc
  • Rows 1 and 3 are identical, rows 2 and 4 are
    identical, rows 4 and 5 are identical.
  • Suppression can replace individual attributes
    with a
  • Generalization replace individual attributes with
    a border category

2015/10/23
17
18
K-Anonymity Protection Model
Definition 1 (Quasi-identifier) A set of non
sensitive attributesQ1, . . . ,Qw of a table
that can be linked with external data to
uniquely identify at least one individual from
the general population are known as
Quasi-identifier.
Definition 2 (k-anonymity requirement) Each
release of data must be such that every
combination of values of quasi-identifiers can be
indistinctly matched to at least k respondents.
Definition 3 (k-Anonymity) A table T satisfies
k-anonymity if for all t in T , there exists
(k-1) other tuples ti1, ti2 , . . . , tik-1 in
T such that tCti1 C ti2 C
tik-1 C, for all C in QI.
19
Metrics used for the Algorithm
m numeric quasi-identifiers N1, N2, Nm and q
categorical quasi-identifiers C1, C2, Cq.
Information Loss L(Pi) Pi D (Pi)
20
Pseudocode of the MOKA Algorithm
//clustering state Sort all the records in the
table using the non-sensitive attributes Set the
number of clusters K number of records in the
table / k value of k-anonymity Remove and set
every kth record of the table as the starting
record of each cluster Find and assign rest of
the records of the table to its nearest
cluster //adjusting stage Find the clusters G
that has records greater than k Sort the records
of the G Remove and assign all the records of G
in R that are greater than kth location Find and
assign every records of R to the closest cluster
of size less than k
21
(MOKA) algorithm
22
Experimental results
23
Future work
  • l-diversity
  • Homogeneity Attack
  • Background knowledge attack
  • t-closeness
  • Skewness attack

24
Conclusion
  • The k-anonymity protection model can prevent
    identity disclosure but lack of diversity of the
    sensitive values attribute breaks the protection
    mechanism.
  • Clustering similar kind of data together before
    anonymization can lower the information loss due
    to generalization.
  • In this research we propose a modified clustering
    method for k-anonymization.
  • We compare our algorithm with k-means algorithm
    and got less information loss for some cases.
  • we are planning to change some parameters of our
    algorithm and would like to check the performance
    with other similar algorithm.

25
Questions?
  • Thank you!

26
References
  • Sweeney, k-anonymity a model for protecting
    privacy, International Journal of Uncertainty,
    Fuzziness and Knowledge-Based Systems, 2002.
  • Jun-Lin Lin, Meng-Cheng Wei, An efficient
    clustering method for k-anonymization.
    Proceedings of the 2008 international workshop on
    Privacy and anonymity in information society,
    ACM, PAIS 2008.
  • Jun-Lin Lin, Meng-Cheng Wei, Chih-Wen Li,
    Kuo-Chiang Hsieh, A hybrid Method for
    k-anonymization, Proceedings of the 2008 IEEE
    Asia-Pacific Services Computing Conference, APSCC
    2008, pp.385-390.
  • Khaled El Emam, Fida Kamal Dankar, Protecting
    Privacy Using K-anonymity, Journal of the
    American Medical Informatics Association Volume
    15 Number 5 September / October 2008.
  • Sweeney, Computational Disclosure Control A
    primer on data privacy Protection, PhD thesis,
    Massachusetts Institute of Technology 2001.
  • Kristen LeFevre, David J. DeWitt, Raghu
    Ramakrishanan, Incognito Efficient Full Domain
    k-anonymity, Proceedings of SIGMOD 2005 June
    14-16, 2005, Baltimore, Maryland, USA, pp. 49-60.
  • Kristen LeFevre, David J. DeWitt, Raghu
    Ramakrishanan, Mondrian Multidimensional
    K-Anonymity, technical report of University of
    Wisconsin, Madison.
  • Robert Lemos, Researchers reverse Netflix
    anonymization, SecurityFocus 2007-12-04,
    (http//www.privacyanalytics.ca/news/netflix.pdf).
  • S. Hettich and S. D. Bay. The UCI KDD Archive,
    1999, http//kdd.ics.uci.edu
  • Ashwin Machanavajjhala, Johannes Gehrke, Daniel
    Kifer, Muthuramakrishana Venkitasubramaniam .
    l-diversity Privacy Beyond k-Anonymity.IEEE
    Internationl Conference on Data Engineering,
    2006.
  • Ninghui Li, Tiancheng Li, Suresh
    Venkatasubramanium, t-Closeness Privacy Beyond
    k-Anonymity and l-diversity, Proceedings of IEEE
    23rd Int'l Conference on Data Engineering (ICDE)
    2007.
Write a Comment
User Comments (0)
About PowerShow.com