Machine Learning and Data Mining Course Summary - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Machine Learning and Data Mining Course Summary

Description:

Data mining (or simple analysis) on people may come with a ... Should males between 18 and 35 from countries that produced ... Should they pay less for ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 28
Provided by: gregoryp8
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning and Data Mining Course Summary


1
Machine Learning and Data Mining Course Summary
2
Outline
  • Data Mining and Society
  • Discrimination, Privacy, and Security
  • Hype Curve
  • Future Directions
  • Course Summary

3
Controversial Issues
  • Data mining (or simple analysis) on people may
    come with a profile that would raise
    controversial issues of
  • discrimination
  • privacy
  • security
  • Examples
  • Should males between 18 and 35 from countries
    that produced terrorists be singled out for
    search before flight?
  • Can people be denied mortgage based on age, sex,
    race?
  • Women live longer. Should they pay less for life
    insurance?
  • Note that these issues arise because of looking
    for niche groups, not through data mining per se

4
Data Mining and Discrimination
  • Can discrimination be based on features like sex,
    age, national origin?
  • In some areas (e.g. mortgages, employment), some
    features cannot be used for decision making
  • In other areas, these features are needed to
    assess the risk factors
  • E.g. people of African descent are more
    susceptible to sickle cell anaemia

5
CRM and Finance
  • Customer Relationship Marketing (CRM) analyse
    customer data to find profitable customers
  • www.bankrate.com, 1999
  • Customers identified as losers by CRM might get
    checking accounts you charge them higher fees
    because you dont want them make them know
    theyre not welcome First Manhattan Consulting
    Group
  • Unprofitable customers will pay an additional
    price in terms of service you answer the cash
    cows first. The losers can wait 20 minutes if
    they call in a question. The losers will just
    make you drown.
  • Raise his ATM, credit card and account fees
    until he leaves.

6
  • Banks want to make a profit off rich customers by
    cross-selling products
  • Bureaus produce household-specific demographic
    data. Consumer groups worry thats a device for
    identification of low-income neighbourhoods
    (compare with low participation postcodes for
    university entrance)
  • Debit Bureau provides Audit Report notifies a
    bank tellers boss when an account is opened
    despite identification of customer as high risk.
  • Hey, I had friends at American Express who lost
    their jobs because they couldnt identify the
    profitable customers. I liked them. They were
    good people who cared about people. But in the
    global economy no one cares if youre a good
    institution. Well maybe they would if you knew
    how to market goodness.

7
Data Mining and Privacy
  • Can information collected for one purpose be used
    for mining data for another purpose
  • In Europe, generally no, without explicit consent
  • In US, generally yes
  • Companies routinely collect information about
    customers and use it for marketing, etc.
  • People may be willing to give up some of their
    privacy in exchange for some benefits
  • See Data Mining And Privacy Symposium,
    www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.ht
    ml

8
Data Mining with Privacy
  • Data Mining looks for patterns, not people!
  • Technical solutions can limit privacy invasion
  • Replacing sensitive personal data with anon. ID
  • Give randomized outputs
  • return salary random()
  • See Bayardo Srikant, Technological Solutions
    for Protecting Privacy, IEEE Computer, Sep 2003

9
Rule Sensitivity
  • Data mining may infer information that is private
    or ethically sensitive. The sensitivity may not
    be apparent to the data miner.
  • Since the data mining process is inductive, many
    rules may be stereotypical or may be misleading
    (because they dont generalise).
  • Privacy is the individuals desire to keep
    certain information about themselves hidden from
    others.
  • Ethics is a set of moral principles or values
    that guides behaviour.

10
Cause for Concern?
  • InfoWeek survey in 2001 found that over 20 of US
    companies store data on their customers including
    medical profile, demographics, salary and credit
    information, and over 15 store information about
    customers legal history.
  • Yet data mining can be very misleading. A study
    in 1997 (Leinweber) found that the best indicator
    for the SP 500 was

the estimated level of butter production in
Bangladesh
11
Privacy Preservation
  • Secure sharing of data between organisations
    sharing for mutual benefit without compromising
    competitiveness
  • Confidentialisation of publicly available data
    ensuring that individuals are not identifiable
    from aggregate data
  • Anonymisation of private data modifying or
    randomising information
  • Access control limit who and what.

12
Privacy Preservation Methods
  • Anonymisation by removing identifiers
  • Noise addition distort values (e.g. add Gaussian
    noise)
  • Data swapping attribute values are interchanged
    to maintain results of statistical queries
  • Merging several values into a coarser category
  • Sampling so only a small amount of data is
    released
  • Coding values so they have no meaning

13
Drawbacks
  • Data modification
  • degrades performance (e.g. creates spurious
    rules)
  • makes it harder to link multiple databases
    (because key is removed)
  • is often specific to the data mining algorithm
  • makes it harder to interpret results.

14
Data Mining and Security Controversy in the News
  • TIA Terrorism (formerly Total) Information
    Awareness Program
  • DARPA program closed by Congress, Sep 2003
  • some functions transferred to intelligence
    agencies
  • CAPPS II screen all airline passengers
  • controversial
  • Invasion of Privacy or Defensive Shield?

15
Criticism of analytic approach to Threat
Detection
  • Data Mining will
  • invade privacy
  • generate millions of false positives
  • But can it be effective?

16
Is criticism sound ?
  • Criticism Databases have 5 errors, so analyzing
    100 million suspects will generate 5 million
    false positives
  • Reality Analytical models correlate many items
    of information to reduce false positives.
  • Example Identify one biased coin from 1,000.
  • After one throw of each coin, we cannot
  • After 30 throws, one biased coin will stand out
    with high probability.
  • Can identify 19 biased coins out of 100 million
    with sufficient number of throws

17
Another Approach Link Analysis
Can Find Unusual Patterns in the Network Structure
18
Analytic technology can be effective
  • Combining multiple models and link analysis can
    reduce false positives
  • Today there are millions of false positives with
    manual analysis
  • Data mining is just one additional tool to help
    analysts
  • Analytic technology has the potential to reduce
    the current high rate of false positives

19
Data Mining and Society
  • No easy answers to controversial questions
  • Society and policy-makers need to make an
    educated choice
  • Benefits and efficiency of data mining programs
    vs. cost and erosion of privacy

20
The Hype Curve for Data Mining and Knowledge
Discovery

Over-inflated expectations
rising expectations
21
The Hype Curve for Data Mining and Knowledge
Discovery

Over-inflated expectations
Growing acceptance and mainstreaming
rising expectations
Disappointment
22
Data Mining Future Directions
  • Currently, most data mining is on flat tables
  • Richer data sources
  • text, links, web, images, multimedia, knowledge
    bases
  • Advanced methods
  • Link mining, Stream mining,
  • Applications
  • Web, Bioinformatics, Customer modeling,

23
Challenges for Data Mining
  • Technical
  • tera-bytes and peta-bytes
  • complex, multi-media, structured data
  • integration with domain knowledge
  • Business
  • finding good application areas
  • Societal
  • privacy and ethical issues

24
Data Mining Central Quest
Find true patterns and avoid overfitting (false
patterns due to randomness)
25
Knowledge Discovery Process
Start with Business (Problem) Understanding
Data Preparation usually takes the most
effort Knowledge Discovery is an Iterative
Process
Data Preparation
26
Key Ideas
  • Avoid Overfitting!
  • Data Preparation
  • catch false predictors
  • evaluation train, validate, test subset
  • Classification C4.5, Bayes, k-nearest neighbour
  • Targeted Marketing Lift, Gains, ROC
  • Clustering, Association, Other tasks
  • Knowledge Discovery is a Process

27
Where next?
  • Data Mining and Knowledge Discovery site
  • www.KDnuggets.com
  • Data Mining and Knowledge Discovery Society ACM
    SIGKDD
  • www.acm.org/sigkdd
Write a Comment
User Comments (0)
About PowerShow.com