Security of Statistical Databases - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Security of Statistical Databases

Description:

Until mid 1880s musters for convicts, blue books for free settlers ... Hundreds of thousands of records or more, e.g., census data, large medical databases ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 48
Provided by: depts182
Category:

less

Transcript and Presenter's Notes

Title: Security of Statistical Databases


1
Security of Statistical Databases
  • Mirka Miller
  • University of Ballarat
  • AUSTRALIA

2
(No Transcript)
3
Talk Overview
  • Confidentiality and Privacy
  • Statistical Databases
  • Statistical Compromise
  • Security Mechanisms
  • Noise Addition
  • Query Restriction
  • Audit Expert
  • Usability
  • Application of Combinatorics of Finite Sets to
    solve a security problem
  • Current Research

4
Confidentiality and Privacy
  • Confidential personal data is kept in
  • Census databases
  • Medical databases
  • Employee databases
  • Taxation office records
  • Criminal records
  • Bank balances and credit records
  • Phone calls
  • Shopping habits
  • Driving records
  • and many many others
  • Confidentiality of personal data is a major issue

5
Confidentiality and Privacy
  • This data is used for statistical analysis and
    knowledge discovery and data mining, and
    facilitates research in many areas, including
  • Marketing
  • Medicine
  • Crime investigation
  • Examples of databases
  • Census
  • Medical

6
Statistical Databases - Census
  • Census in Australia goes back more than 200 years
  • Until mid 1880s musters for convicts, blue books
    for free settlers
  • By 1850 it became a stigma to be the descendant
    of a convict
  • Usually there were about 5 copies of a census
    kept well locked up
  • Not enough protection early census books were
    ordered destroyed privacy concerns
  • Oldest surviving census in Australia is the 1828
    NSW census

7
Statistical Databases - Census (cont.)
  • Summary tables based on the census were published
    about a year later
  • Census in Australia now
  • Data is kept on a computer
  • Release of summary tables takes several years
  • No on-line access
  • Much more detailed information privacy concerns

8
Statistical Databases - Medical
  • Another common concern regarding privacy is
    medical records
  • Traditionally
  • confidentiality was assured by the doctors
    Hippocratic oath and the privacy of the household
  • The situation now
  • patient records are usually kept in a
    computerised database access issues
  • Statistical access de-personalising removing
    patient names before performing statistics - not
    enough de-identified does not always mean
    unidentifiable

9
Statistical Databases
  • What is sensitive data?
  • Hundred years ago in Australia convict ancestry
  • Now more likely to be the amount of money a
    person makes or having an illness such as AIDS or
    mental illness
  • Demand for privacy of personal data is nothing
    new
  • What is new our use of computers gives us a
    great capacity to gather data, process data,
    communicate data, generate statistics based on
    the data

10
Security Problem of SDB
  • A statistical database is an ordinary database
    that returns only statistical information to user
    queries, based on groups of records
  • SDB is compromised if it is possible to derive
    confidential individual information from
    statistics.
  • The security problem of statistical database is
    to control the use of statistical database so
    that only statistical information about groups of
    records is available and no sequence of queries
    is sufficient to infer protected information
    about any individual

11
Statistical Databases
  • Various sizes
  • Few hundreds of records or less, e.g., medical
    data collected in an experiment
  • Hundreds of thousands of records or more, e.g.,
    census data, large medical databases
  • General purpose or for statistics only
  • On-line or off-line
  • Static or dynamic

12
Statistical Database Example
Table 1. Employee
13
Database Compromise Example
  • If anybody knows that Joe is the only Male staff
    in the Operations section then it is possible to
    disclose his salary using a SUM query
  • select SUM(Salary)
  • from Employee
  • where Section Operations AND Gender M
  • Answer 37500

14
Database Compromise Example
  • If a statistical query that selects only one
    record is not allowed then we can use two
    queries
  • select SUM(Salary)
  • from Employee
  • where SectionOperations AND RankOfficer AND
    GenderF
  • Answer 89500 rows satisfying the query are R3
    and R5
  • select SUM(Salary)
  • from Employee
  • where Section Operations AND Rank
    Officer
  • Answer 127000 rows satisfying the query are
    R3, R5, R10
  • Then it can be deduced that Joes salary is
    127000 - 89500 37500

15
Types of Compromise
  • Positive compromise occurs whenever it is found
    that an individual belongs to a particular
    category or holds a particular data value
  • Negative compromise occurs when it is determined
    that the individual does not belong to the
    category or does not possess a particular data
    value
  • If everything in a database can be deduced, the
    database is completely compromised
  • If information concerning at least one
    individual can be determined, the database is
    partially compromised

16
Types of Compromise
  • If the restricted information about a person is
    disclosed, it is considered to be an individual
    compromise on the other hand, the disclosed
    information about a group of individuals is
    called a group compromise
  • A subset S (S gt1) of records in a database is
    relatively compromised if the relative order of
    magnitude of the individuals in S is known

17
Data Inference and Compromise
  • All statistics contain vestiges of information
  • about individual records on which they are based
  • It is often possible to compare several
    statistics and be able to derive some additional
    information about an individual this is called
    data inference
  • If the data inference concerns supposedly
    protected (confidential) information then the
    statistical database has been compromised
  • We are mostly interested in databases used
  • For day-to-day processing of individual records,
    as well as
  • To obtain statistics about various subpopulations
    of the database

18
Security Problem of SDB
  • Many information security problems can be solved
    by the use of
  • proper access methods
  • encryption
  • but the problem of statistical compromise cannot
    be so solved since it concerns an authorised user
    performing authorised actions and the statistics
    are needed in the clear

19
Control Mechanisms
  • Most data inference techniques rely on set theory
    and/or linear algebra
  • Many techniques for the prevention of data
    compromise have been proposed
  • Two basic categories
  • Noise addition
  • Query restriction

20
Noise Addition
  • Noise addition techniques include
  • Data perturbation
  • Response perturbation
  • Data swapping
  • Random queries
  • All introduce some error to the answers of
    statistical queries
  • Problem is, how much error should be used?
  • If too little then can get approximate compromise
  • If too much then the statistics are useless

21
Query Restriction
  • Query restriction techniques include
  • Query set-size restriction
  • Query overlap restriction
  • Maximum order queries
  • Audit Expert
  • Query restriction techniques either answer a
    query correctly (precisely) or not at all
    possibly giving an error message

22
Audit Expert
  • The task of auditing may be delegated to the
    database system so that the database system
  • keeps track of the history of answered queries
    and changes in SDB
  • checks for possible compromise by every new query
  •   

23
Audit Expert Advantages
  • Absolute Security
  • By checking the past history of all the answered
    queries, auditing allows the SDB to answer a
    query only when it is secure to do so
  • Maximum Information
  • Given the previous query history of the SDB,
    auditing can provide the maximum information to
    the users. This includes accurate answers and as
    many query answers to the users as the security
    of the SDB permits

24
Security Problem of SDB - Usability
  • A query is answerable if answering it does not
    lead to a compromise.
  • SDB usability is the ratio of the number of
    answerable queries to the number of posable
    queries.
  • Statistical query types SUM, COUNT, MAX, MIN,
    MEAN, etc.
  • The security problem of SDB is to find a control
    mechanism which would prevent compromise while
    maximising the usability.

25
Problem 1
  • Given A collection of SUM queries
    Compromise exact
  • Question Is the collection of queries
    compromise-free?

26
Audit Expert
  • In a database of n1 records, a collection of k
    SUM queries can be thought of as a system of k
    linear equations
  • ?1,1x1 ?1,2x2 . . . ?1,n1xn1 q1
  • ?2,1x1 ?2,2x2 . . . ?2,n1xn1 q2
  • .
  • .
  • .
  • ?k,1x1 ?k,2x2 . . . ?k,n1xn1 qk

27
Audit Expert
  • In matrix form, Mk X Q
  • Mk is called the query matrix.
  • . . .
  • . . .
  • .
  • Mk

    .
  • .
  • . . .
  • In order to avoid compromise we can have at most
    n linearly independent queries.

28
Audit Expert
Table 1. Employee
Salary of all Males query set (1 0 0 1
0) Salary of all Females (0
1 1 0 1) Salary of all Officers
(1 0 1 0 1) Salary of all under 25
(0 0 0 1 1)
29
Audit Expert
  • Salary of all Males query set (1 0 0 1
    0)
  • Salary of all Females (0 1
    1 0 1)
  • Salary of all Officers (1
    0 1 0 1)
  • Salary of all under 25 (0 0
    0 1 1)
  • M5 ?

30
Audit Expert
  • Using Gaussian elimination we obtain
  • Mk (Ik Mk)
  • where Ik is k ? k identity matrix and Mk is
    k?(n-k1) matrix.
  • Answer to Problem 1
  • Database is compromised if and only if there is a
    row in the normalised query matrix Mk that
    contains exactly one non-zero element. (Chin and
    Oszoyoglu, 1982)

31
Problem 2
  • Given A database of n1 records Posable
    queries all SUM queries Compromise
    exact
  • Question What is the maximum number
    of queries that can be answered
    without compromise?

32
Mathematical Formulation
  • In a multiset A a1, a2, . . . , an, ai ? R,
    ai ? 0, what is the maximum number of partial
    sums that are equal to 0 or 1?
  • What can the numbers a1, a2, . . . , an be to
    achieve the maximum?

33
Mathematical Formulation (cont.)
  • Where does this formulation come from?
  • Mn

34
Example of Low Usability
  • Example n6, maximum number of answerable
    queries is 6.
  • M6

35
Example of High Usability
  • Example n6, maximum number of answerable
    queries is 20.
  • M6

36
Usability of Secure SDB
  • Answer to Problem 2
  • Theorem (Miller, Roberts, Simpson, 1991)
  • The maximum usability of a statistical database
    of n records in which all queries are posable,
    for exact compromise, is

37
Usability of Secure SDB
  • Proof makes use of a symmetric chain
    decomposition of an n-set and the following
    theorem.
  • Theorem (Anderson, 1987)
  • The subsets of a n-set can be partitioned into
    (n choose ?n/2? ) disjoint symmetric chains.
    The number of symmetric chains containing exactly
    i subsets is equal to (n choose ?(ni)/2? ) - (n
    choose ?(ni1)/2? )

38
Symmetric Chain Decomposition
  • Symmetric chain decomposition of a 4-set
    Sa,b,c,d
  • Ø ? a ? a,b ? a,b,c ?
    a,b,c,d
  • b ? b,c ? b,c,d
  • c ? c,d ? a,c,d
  • d ? a,d ? a,b,d
  • a,c
  • b,d

39
Symmetric Chain Decomposition
  • Suppose a-2, b-1, c1, d1. Associate with each
    subset the sum of the values of its elements.
  • Ø ? a ? a,b ? a,b,c ?
    a,b,c,d
  • 0 -2 -3 -2
    -1
  • b ? b,c ? b,c,d
  • -1 0 1
  • c ? c,d ? a,c,d
  • 1 2 0
  • d ? a,d ? a,b,d
  • 1 -1 -2
  • a,c -1 and b,d 0

40
Symmetric Chain Decomposition
  • a-2, b-1, c1, d1. Rearrange the order of the
    subsets so that the partial sums in an
    antichain are strictly increasing
  • Ø ? a ? a,b ? a,b,c ?
    a,b,c,d
  • a,b b Ø c
    c,d
  • -3 -1 0
    1 2
  • b ? b,c ? b,c,d
  • c ? c,d ? a,c,d
  • d ? a,d ? a,b,d
  • a,c
  • b,d

41
Symmetric Chain Decomposition
  • The rest of the proof is a routine application
    of Andersons Theorem, adding once the number of
    singleton antichains plus twice the number of
    non-singleton antichains.
  • a,b b Ø
    c c,d
  • a a,c
    a,c,d
  • a,b,c a,b,c,d b,c,d
  • a,b,d b,d d
  • b,c
  • a,d

42
Maximum Usability of Secure SDB
  • The maximum usability of a statistical database
    of n records in which all queries are posable,
    for exact compromise, is
  • It is achieved by having roughly half 1s and
    half -1s.

43
Our Research
  • Identifying a new type of compromise called a
    relative compromise. We gave techniques both for
    achieving it as well as preventing against it
    (with Jennie Seberry)
  • Finding general upper bounds for the number of
    statistical queries that can be safely answered
    (with Jamie Simpson, Ian Roberts, Ljiljana
    Brankovic)
  • For practical reasons we allow to include a given
    set of queries before maximising (with Ljiljana
    Brankovic, Jozef Siran)

44
Our Research
  • Supplementary knowledge is always needed for the
    identification of a particular record. We have
    given a general framework of data inference and
    compromise incorporating supplementary knowledge
  • Lately we became interested in range queries
    (with Ljiljana Brankovic, Peter Horak)
  • Implementing expert system including legal
    requirements (with Vivek Mishra, Andrew
    Stranieri, Joe Ryan)
  • Theoretical research continues (with Yuliya
    Lenard)

45
Our Research
  • The security problem of statistical databases is
  • An important practical problem
  • Should be of concern to any owner of a database
    that releases statistics based on confidential
    personal data
  • Is becoming more and more important as new
    privacy acts and legislations are introduced
  • As insurance companies get involved it will be
    necessary to demonstrate that security measures
    are used

46
Our Research
  • The security problem of statistical databases is
  • An ideal topic for a PhD thesis
  • An excellent vehicle for theoretical research,
    practical implementations, comparative study of
    security mechanisms
  • Links with other researchers in Australia and
    overseas
  • Information security courses are being introduced
    in many universities so chances of employment may
    be higher than in other areas

47
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com