Anonymizing Sequential Releases - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Anonymizing Sequential Releases

Description:

The database is made anonymous to its local QID. ... releases, the database must be made anonymous to a global QID spanning the join ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 39
Provided by: benjami109
Category:

less

Transcript and Presenter's Notes

Title: Anonymizing Sequential Releases


1
Anonymizing Sequential Releases
Benjamin C. M. Fung Simon Fraser
University bfung_at_cs.sfu.ca
Ke Wang Simon Fraser University wangk_at_cs.sfu.ca
ACM SIGKDD 2006
2
Motivation Sequential Releases
  • Previous works address single release only.
  • Data are typically released sequentially in
    multiple versions.
  • New information become available.
  • A tailored view for each data sharing purpose.
  • Separate releases for sensitive and identifying
    information.

3
T2 Previous Release T2 Previous Release T2 Previous Release
Pid Job Disease
1 Banker Cancer
2 Banker Cancer
3 Clerk HIV
4 Driver Cancer
5 Engineer HIV
T1 Current Release T1 Current Release T1 Current Release T1 Current Release
Pid Name Job Class
1 Alice Banker c1
2 Alice Banker c1
3 Bob Clerk c2
4 Bob Driver c3
5 Cathy Engineer c4
The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job
Pid Name Job Disease Class
1 Alice Banker Cancer c1
2 Alice Banker Cancer c1
3 Bob Clerk HIV c2
4 Bob Driver Cancer c3
5 Cathy Engineer HIV c4
- Alice Banker Cancer c1
- Alice Banker Cancer c1
Do not want Name to be linked to Disease in the
join of the two releases.
4
T2 Previous Release T2 Previous Release T2 Previous Release
Pid Job Disease
1 Banker Cancer
2 Banker Cancer
3 Clerk HIV
4 Driver Cancer
5 Engineer HIV
T1 Current Release T1 Current Release T1 Current Release T1 Current Release
Pid Name Job Class
1 Alice Banker c1
2 Alice Banker c1
3 Bob Clerk c2
4 Bob Driver c3
5 Cathy Engineer c4
The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job
Pid Name Job Disease Class
1 Alice Banker Cancer c1
2 Alice Banker Cancer c1
3 Bob Clerk HIV c2
4 Bob Driver Cancer c3
5 Cathy Engineer HIV c4
- Alice Banker Cancer c1
- Alice Banker Cancer c1
join sharpens identification Bob, HIV has
groups size 1.
5
T2 Previous Release T2 Previous Release T2 Previous Release
Pid Job Disease
1 Banker Cancer
2 Banker Cancer
3 Clerk HIV
4 Driver Cancer
5 Engineer HIV
T1 Current Release T1 Current Release T1 Current Release T1 Current Release
Pid Name Job Class
1 Alice Banker c1
2 Alice Banker c1
3 Bob Clerk c2
4 Bob Driver c3
5 Cathy Engineer c4
The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job
Pid Name Job Disease Class
1 Alice Banker Cancer c1
2 Alice Banker Cancer c1
3 Bob Clerk HIV c2
4 Bob Driver Cancer c3
5 Cathy Engineer HIV c4
- Alice Banker Cancer c1
- Alice Banker Cancer c1
join weakens identification Alice, Cancer has
groups size 4.
lossy join combat join attack.
6
T2 Previous Release T2 Previous Release T2 Previous Release
Pid Job Disease
1 Banker Cancer
2 Banker Cancer
3 Clerk HIV
4 Driver Cancer
5 Engineer HIV
T1 Current Release T1 Current Release T1 Current Release T1 Current Release
Pid Name Job Class
1 Alice Banker c1
2 Alice Banker c1
3 Bob Clerk c2
4 Bob Driver c3
5 Cathy Engineer c4
The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job The join on T1.Job T2.Job
Pid Name Job Disease Class
1 Alice Banker Cancer c1
2 Alice Banker Cancer c1
3 Bob Clerk HIV c2
4 Bob Driver Cancer c3
5 Cathy Engineer HIV c4
- Alice Banker Cancer c1
- Alice Banker Cancer c1
join enables inferences across tables Alice?Cance
r has 100 confidence.
7
Related Work
  • k-anonymity SS98, FWY05, BA05, LDR05, WYC04,
    WLFW06
  • Quasi-identifier (QID) e.g., Job, birth date,
    Zip.
  • The database is made anonymous to its local QID.
  • In sequential releases, the database must be made
    anonymous to a global QID spanning the join of
    all releases thus far.

Explicit ID (removed) QID (anonymized to groups of size k) Sensitive attributes
8
Related Work
  • l-diversity MGK06
  • Sensitive values are well-represented in each
    QID group (measured by entropy).
  • Confidence limiting WFY05, WFY06
  • qid ? s, confidence lt h
  • where qid is a QID group, s is a sensitive value.

9
Related Work
  • View releases
  • T1 and T2 are two views in one release, both can
    be modified before the release.
  • MW04, DP05 measures information disclosure of a
    view set wrt a secret view.
  • YWJ05, KG06 detects privacy violation by a view
    set over a base table.
  • Detect, not eliminate, violations.

10
Sequential Release
  • Sequential release
  • Current release T1. Previous release T2.
  • T1 was unknown when T2 was released.
  • T2 cannot be modified when T1 is released.
  • Solution 1 k-anonymize all attributes in T1 -
    excessive distortion.
  • Solution 2 generalize T1 based on T2 -
    monotonically distort the later release.
  • Solution 3 anonymize a complete cohort of all
    potential releases at one time must predict all
    future releases

11
Intuition of Our Approach
  • A lossy join hides the true join relationship to
    cripple a global QID.
  • Generalize T1 so that the join with T2 becomes
    lossy enough to disorient the attacker.
  • Two general privacy notions (X,Y)-anonymity and
    (X,Y)-linkability, where X and Y are sets of
    attributes.

12
(X,Y)-Privacy
  • k-anonymity of distinct records for each QID
    group k.
  • (X,Y)-anonymity of distinct Y values for each
    X group k.
  • (X,Y)-linkability the maximum confidence of
    having a Y value given having a X value is k.
  • Generalize k-anonymity SS98 and confidence
    limiting WFY05, WFY06.

13
Example (X,Y)-Anonymity
Pid Job Zip PoB Test
1 Banker 123 Canada HIV
1 Banker 123 Canada Diabetes
1 Banker 123 Canada Eye
2 Clerk 456 Japan HIV
2 Clerk 456 Japan Diabetes
2 Clerk 456 Japan Eye
2 Clerk 456 Japan Heart
  • k-anonymity uses of records as anonymity,
    fails to ensure k distinct patients.

14
Example (X,Y)-Anonymity
  • Anonymity wrt patients (instead of records)
  • X Job, Zip, PoB and Y Pid
  • Each X group is linked to at least k distinct
    values on Pid.
  • Anonymity wrt tests
  • X Job, Zip, PoB and Y Test
  • Each X group is linked to at least k distinct
    tests.

15
Example (X,Y)-Linkability
Pid Job Zip PoB Test
1 Banker 123 Canada HIV
2 Banker 123 Canada HIV
3 Banker 123 Canada HIV
4 Banker 123 Canada Diabetes
5 Clerk 456 Japan Diabetes
6 Clerk 456 Japan Diabetes
  • Banker,123,Canada ? HIV (75 confidence).
  • With Y Test, (X,Y)-linkability states that no
    test can be inferred from a X group with
    confidence gt a given threshold.

16
Problem Statement
  • The data holder made previous release T2 and now
    makes current release T1, where T2 and T1 are
    projections of the same underlying table.
  • Want to ensure (X,Y)-privacy on the join of T1
    and T2, where X and Y are attribute sets on the
    join.
  • Sequential anonymization generalize T1 on X n
    att(T1) so that the join satisfies (X,Y)-privacy
    and T1 remains as useful as possible.

17
Generalization / Specialization
  • Each generalization replaces all child values
    with the parent value.
  • A cut contains exactly one
  • value on every root-to-leaf
  • path.
  • Alternatively, each specialization replaces the
    parent value with a consistent child value in the
    record.

18
Match Function
  • The attacker applies prior knowledge to match the
    records in T1 and T2.
  • So, the data holder applies such prior knowledge
    in sequential anonymization
  • We consider prior knowledge
  • schema information of T1 and T2.
  • taxonomies for attributes.
  • the inclusion-exclusion principle.

19
Match Function
  • Let t1 ? T1 and t2 ? T2.
  • Inclusion Predicate t1.A matches t2.A if they
    are on the same generalization path for attribute
    A.
  • e.g., Male matches Single Male.
  • Exclusion Predicate t1.A matches t2.B only if
    they are not semantically inconsistent (based on
    common sense).
  • To exclude impossible matches.
  • e.g., Male and Pregnant are semantically
    inconsistent, so are Married Male and 6 Month
    Pregnant.

20
Algorithm Overview
  • Top-Down Specialization
  • Input T1, T2, (X,Y)-privacy, a taxonomy tree for
    each attribute in X1X n att(T1).
  • Output a generalized T1 satisfying the privacy
    requirement.
  • generalize every value of Aj to ANYj where Aj ?
    X1
  • while there is a valid candidate in ?Cutj do
  • find the winner w of highest Score(w) from
    ?Cutj
  • specialize w on T1 and remove w from ?Cutj
  • update Score(v) and the valid status for all
    v in ?Cutj
  • end while
  • output the generalized T1 and ?Cutj

21
Anti-Monotone Privacy
  • Theorem 1 On a single table, (X,Y)-privacy is
    anti-monotone wrt specialization on X if
    violated, remains violated after a
    specialization.
  • On the join of T1 and T2, (X,Y)-privacy is not
    anti-monotone wrt specialization of T1.
  • Specializing T1 may create dangling records,
    e.g., by specializing CA into LA and San
    Francisco, LA records in T1 no longer match
    San Francisco records in T2.

22
Anti-Monotone Privacy
  • Theorem 2 Assume that T1 and T2 are projections
    of the same underlying table, (X,Y)-privacy on
    the join of T1 and T2 is anti-monotone wrt
    specialization of T1 on X n att(T1).

23
Score Metric
  • Each specialization gains some information and
    loses some privacy. We maximize gain per loss
  • InfoGain(v) is measured on T1.
  • PrivLoss(v) is measured on the join of T1 and T2.

24
Challenges
  • Each specialization affects the matching of join,
    Score(v), and privacy checking.
  • rejoining T1 and T2 for each specialization is
    too expensive.
  • Materializing the join is impractical because a
    lossy join can be very large.
  • Our solution Incrementally maintains some count
    statistics without executing the join
  • extension of Top-Down Specialization
    FWY05WFY05

25
Empirical Study
  • The Adult data set. 45222 records. Categorical
    attributes only.

26
  • Schema for T1 and T2
  • T1 contains the Class Income level

Department Attribute of Leaves of Levels
Taxation (T1) Education (E) 16 5
Taxation (T1) Occupation (O) 14 3
Taxation (T1) Work-class (W) 8 5
Common (T1 T2) Marital-status (M) 7 4
Common (T1 T2) Relationship (Ra) 6 3
Common (T1 T2) Sex (S) 2 2
Immigration (T2) Native-country (Nc) 40 5
Immigration (T2) Race (Ra) 5 3
27
Empirical Study
  • Classification metric
  • Classification error on the generalized testing
    set of T1.
  • Distortion metric SS98
  • 1 unit of distortion for generalization of each
    value in each record.
  • Normalized by the number of records.

28
(X,Y)-Anonymity
  • TopN attributes most important for
    classification.
  • Join attributes are Top3 attributes.
  • X contains
  • TopN attributes in T1 (to ensure that the
    generalization is performed on important
    attributes),
  • all join attributes,
  • all attributes in T2 (to ensure X is global).

29
  • Distortion of (X,Y)-anonymity
  • Ki denotes the key in Ti.
  • XYD our method with Y K1.
  • KAD k-anonymization on QIDatt(T1).

30
  • Classification error of (X,Y)-anonymity
  • XYE our method with Y K1.
  • XYE(row) our method with YK1,K2.
  • BLE the unmodified data.
  • KAE k-anonymization on QIDatt(T1).
  • RJE removing all join attributes from T1.

31
(X,Y)-Linkability
  • Y contains TopN attributes.
  • If not important, simply remove them.
  • X contains the rest of the attributes in T1 and
    T2.
  • Focus on classification error because no previous
    work studies distortion for (X,Y)-linkability.

32
  • Classification error of (X,Y)-linkability
  • XYE our method with Y TopN.
  • BLE the unmodified data.
  • RJE removing all join attributes from T1.
  • RSE removing all attributes in Y from T1.

33
Scalability
(X,Y)-anonymity (k40)
(X,Y)-linkability (k90)
34
Conclusion
  • Previous k-anonymization focused on a single
    release of data.
  • Studied the sequential anonymization problem when
    data are released sequentially and a global QID
    may span several releases.
  • Introduced lossy join to hide the join
    relationship and weaken the global QID.
  • Addressed challenges due to large size of lossy
    join.
  • Extendable to more than two releases T2,,Tp.

35
References
  • BA05 R. Bayardo and R. Agrawal. Data privacy
    through optimal k-anonymization. In IEEE ICDE,
    pages 217.228, 2005.
  • DP05 A. Deutsch and Y. Papakonstantinou.
    Privacy in database publishing. In ICDT, 2005.
  • FWY05 B. C. M. Fung, K. Wang, and P. S. Yu.
    Top-down specialization for information and
    privacy preservation. In IEEE ICDE, pages
    205.216, April 2005.
  • KG06 D. Kifer and J. Gehrke. Injecting utility
    into anonymized datasets. In ACM SIGMOD, Chicago,
    IL, June 2006.

36
References
  • LDR05 K. LeFevre, D. J. DeWitt, and R.
    Ramakrishnan. Incognito Efcient full-domain
    k-anonymity. In ACM SIGMOD, 2005.
  • MGK06 A. Machanavajjhala, J. Gehrke, and D.
    Kifer. l-diversity Privacy beyond k-anonymity.
    In IEEE ICDE, 2006.
  • MW04 A. Meyerson and R. Williams. On the
    complexity of optimal k-anonymity. In PODS, 2004.
  • SS98 P. Samarati and L. Sweeney. Protecting
    privacy when disclosing information k-anonymity
    and its enforcement through generalization and
    suppression. In IEEE Symposium on Research in
    Security and Privacy, May 1998.

37
References
  • WFY05 K. Wang, B. C. M. Fung, and P. S. Yu.
    Template-based privacy preservation in
    classification problems. In IEEE ICDM, pages
    466.473, November 2005.
  • WFY06 K. Wang, B. C. M. Fung, and P. S. Yu.
    Handicapping attacker's condence An alternative
    to k-anonymization. Knowledge and Information
    Systems An International Journal, 2006.
  • WYC04 K. Wang, P. S. Yu, and S. Chakraborty.
    Bottom-up generalization A data mining solution
    to privacy protection. In IEEE ICDM, November
    2004.

38
References
  • WLFW06 R. C. W. Wong, J. Li., A. W. C. Fu, and
    K. Wang. (?,k)-anonymity An enhanced k-anonymity
    model for privacy preserving data publishing. In
    ACM SIGKDD, 2006.
  • YWJ05 C. Yao, X. S. Wang, and S. Jajodia.
    Checking for k-anonymity violation by views. In
    VLDB, 2005.
Write a Comment
User Comments (0)
About PowerShow.com