Data Mining and Its Applications - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Data Mining and Its Applications

Description:

Cars. Salary. Name. 9/7/09. http://www.cs.ust.hk/~qyang. 4. The problem: Example 2 ... I can go on and on about problems, but what do they share in common? ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 67
Provided by: Qiang
Category:

less

Transcript and Presenter's Notes

Title: Data Mining and Its Applications


1
Data Mining and Its Applications
  • Qiang Yang
  • Hong Kong University of Science and Technology
  • qyang_at_cse.ust.hk
  • http//www.cse.ust.hk/qyang
  • Thanks Jiawei Han, Sunny Chee, Jian Pei, and
    many others

2
Outline
Part
  • The problem
  • The (ideal) formulation
  • The (ideal) solution
  • The reality
  • The practice today
  • Looking into the future

3
Part I The problem Example 1
  • What to do?
  • The Data Mining method
  • Collect lots of customer data
  • Make prediction
  • You are a marketing manager for a brokerage
    company
  • Problem Churn is too high (also known as
    Attrition)
  • Turnover is 40 annually
  • Have some potential incentives, but
  • giving new incentives to everyone who might leave
    is very expensive (as well as wasteful)
  • Bringing back a customer after they leave is both
    difficult and costly!

3
4
The problem Example 2
  • You are running a Web search engine
  • You make by selling advertisements
  • Each click on an Ad ? 1.00
  • How to match query terms to ads?
  • Try google query bank

5
The Problem Example 3
  • You are running a brokerage firm
  • You wish to find out the correlations between
    economic factors and stock price
  • How to predict tomorrows stock market?
  • We know quite a bit about the economic factors
  • But we dont know how well they combine to take
    effect
  • Can we ask computers to do that for us?

6
The problem Example 4
  • We wish to design a system to recognize if a user
    of our credit card is a thief
  • How to recognize from the purchases?
  • We wish to find our whether a person who log in
    to our computer system is a hacker
  • How to recognize from the sequence of commands?

7
The problem Example 5
  • You are running an online shop such as Amazon.com
  • Your user has bought some books already
  • Which other books do you recommend to your user?
  • Based on other peoples purchases?
  • This requires that you know which two users are
    alike
  • If they are similar to each other, then their
    purchases are also similar to each other

8
The Problem Example 6
  • You have an old parent at home, alone
  • You wish to watch out for the parent using
    sensors
  • If something seems wrong, you wish to be
    notified, and call the doctor right away
  • How do you tell if something is out of the norm
    and sound alarm in time?

9
Part II The (ideal) Formulation
denotes 1 denotes -1
  • I can go on and on about problems, but what do
    they share in common?
  • Given a set of objects with known descriptive
    variables, and from known classes, i
  • (Xi, yi), i1, 2, t
  • yi 1, or -1
  • Prob 1
  • Find a discriminative function f(x) such that
    f(Xi)yi
  • Supervised Learning
  • Prob 2
  • Find clusters of similar data Xi
  • Unsupervised Learning

10
DM Definition Predictive Model is akin to
  • A black box that makes predictions about the
    future based on information from the past and
    present
  • Large number of inputs usually available

10
11
Convergence of Three Technologies
  • Machine learning and data mining algorithms are
    improving in dramatic pace
  • We can handle millions of data in seconds on a
    typical PC

11
12
How are Models Built and Used?
  • View from 20,000 feet

12
13
Predictive Models are
  • Decision Trees
  • Nearest Neighbor
  • Neural Networks
  • Rule Induction
  • Clustering
  • Marketing
  • Direct mail marketing
  • Web site personalization
  • Fraud Detection
  • Credit card fraud detection
  • Science
  • Bioinformatics
  • Gene analysis
  • Web Text analysis
  • Google

13
14
Part III Ideal Solutions
  • Decision Trees
  • Naïve Bayesian
  • Support Vector Machines
  • K-means Clustering
  • Correlation Analysis
  • First, decision trees
  • Have you played the 20 question game?

15
Solution 1 Decision trees for Risk Analysis
(thanks D.B. Chan, S.H. Univ. )
  • Cases

16
Classification (thanks D.B. Chan, S.H. Univ. )
  • Training Phase

1
Decision Tree
Income High
Income Low
D1 D2
D1
D2
17
Classification Analysis (thanks D.B. Chan, S.H.
Univ. )
  • Training Phase

1
2
Decision Tree
D1a D2
Income High
Income Low
D1b
D1
D2
Married No
Married Yes
Poor
D1a
D1b
Poor
Good
18
Classification Prediction
  • Prediction Phase a new data item arrives

Name Susan, Debt Low, Income High,Married
Yes then what is the Risk factor?
Decision Tree
Income High
Income Low
D1
D2
Married Yes
Married No
D1a
D1b
Good
Poor
So the Risk factor for this customer is Good!
19
Classification Tree Can be used for classifying
type of customer
???????
??????????? ? 12 ? 88
??
gt45
45lt
???? ?44 ?56
???? ?10 ?90
??
??
??
??
gt23
23gt
???? ?11 ?89
???? ?48 ?52
???? ?18 ?82
???? ?3 ?97
??
??
??
lt25,000
??
??
??
??
gt25,000
???? ?1 ?99
???? ?13 ?87
???? ?33 ?67
???? ?55 ?45
???? ?24 ?76
???? ?6 ?94
??
??
??
??
??
????
??
????
???
??
lt60,000
gt60,000
???? ?6 ?94
???? ?65 ?35
???? ?35 ?65
???? ?55 ?45
???? ?31 ?69
???? ?8 ?92
???? ?0 ?100
???? ?21 ?79
.
.
.
.
.
good
bad
good
good
good
bad
20
Solution 2 Bayesian Methods
21
Naïve Bayes Method
  • Assume that the attributes are independent given
    the class Play

play
outlook
temp
humidity
windy
Pr(outlooksunny windytrue, playyes)
Pr(outlooksunnyplayyes)
22
Probabilities weather data
  • A new day

23
Naïve Bayesian is very accurate, why? (Domingos,
1997)
24
Solution3 Support Vector Machines
  • Hard-Margin Linear Classifier
  • Maximize Margin
  • Support Vectors
  • Quadratic Programming
  • Soft-Margin Linear Classifier
  • Non-Linear Separable Problem and Kernels
  • XOR
  • Extension to
  • Regression for numerical class
  • Ranking rather than classification
  • SMO and Core vector machines
  • Problem
  • Given a set of objects with known descriptive
    variables, and from known classes, i
  • (Xi, yi), i1, 2, t
  • Find
  • a discriminative function f(x) such that
    f(Xi)yi.
  • SVM today
  • A must try for most applications
  • Mathematically well founded
  • Robust to noise (non-support vectors ignored)
  • Works even for dozens of training data
  • Among the most accurate algorithms
  • Has many extensions
  • Can be scaled up (ongoing work)

25
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
26
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
27
Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the maximum margin.
Support Vectors are those datapoints that the
margin pushes up against
28
SVM Questions
  • Can we understand the meaning of the SVM through
    a solid theoretical foundation?
  • Can we extend the SVM formulation to handle cases
    where we allow errors to exist, when even the
    best hyperplane must admit some errors on the
    training data?
  • Can we extend the SVM formulation so that it
    works in situations where the training data are
    not linearly separable?
  • Can we extend the SVM formulation so that the
    task is to rank the instances in the likelihood
    of being a positive class member, rather than
    classification?
  • Can we scale up the algorithm for finding the
    maximum margin hyperplanes to thousands and
    millions of instances?

29
Q1. Support Vector Machines Foundations
  • The problem of finding the maximum margin can be
    transformed to finding the roots of a Lagrangian
  • Can be solved using quadratic programming (QP)
  • Has solid theoretical foundations
  • Future error lt Training error Ch1/2
  • hVC dimension, which is the max number of
    examples shattered by a function class f(x,a)

30
Q2 extension to allow errors
When noise exists Minimize w.w C (distance
of error points to their
correct place)
31
Q3 Non-linear transformation to Feature spaces
  • General idea introduce kernels

F x ? f(x)
32
Q4 extension to other tasks?
f(x,w,b) w. x - b
  • SVM for ranking
  • Ranking SVM
  • Idea
  • For each order pair of instances (x1, x2) where
    x1 lt x2 in ranking
  • Generate a new instance
  • ltx1,x2,1gt
  • Train an SVM on the new data set
  • SVM for regression analysis
  • SV regression

33
Q5 Scale Up?
  • One of the initial drawbacks of SVM is its
    computational inefficiency.
  • However, this problem is being solved with great
    success.
  • SMO
  • break a large optimization problem into a series
    of smaller problems, each only involves a couple
    of carefully chosen variables
  • The process iterates
  • Core Vector Machines
  • finding an approximate minimum enclosing ball of
    a set of instances.
  • These instances, when mapped to an N-dimensional
    space, represent a core set
  • Allows solution in very fast speed.
  • Train high quality SVM on millions of data in
    seconds

34
Solution 4 The K-Means Clustering Method
  • Given N customers data
  • Can you find out who are similar?
  • Can you put all similar users in the same group
    automatically?
  • K-means clustering
  • Assumes that you wish to have K such groups
  • Assumes that the data are given without the class
    labels
  • Assume that each dimension of the data has a
    concept of a mean value

35
The mean point
The mean point can be a virtual point
36
The K-Means Clustering Method
  • Example

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
37
Comments on the K-Means Method
  • Strength Relatively efficient
  • Comment Often terminates at a local optimum.
  • Weakness
  • Applicable only when mean is defined, then what
    about categorical data?
  • Need to specify k, the number of clusters, in
    advance
  • Unable to handle noisy data and outliers too well
  • Not suitable to discover clusters with non-convex
    shapes

What about these?
38
Solution 5 Data Mining for Recommendation
  • Which movie would Sammy watch next?
  • Ratings 1--5

39
Statistical Collaborative Filters
  • Users annotate items with numeric ratings.
  • Users who rate items similarly become mutual
    advisors.
  • Recommendation computed by taking a weighted
    aggregate of advisor ratings.

40
Basic Idea
  • Nearest Neighbor Algorithm
  • Given a user a and item i
  • First, find the the most similar users to a,
  • Let these be Y
  • Second, find how these users (Y) ranked i,
  • Then, calculate a predicted rating of a on i
    based on some average of all these users Y
  • How to calculate the similarity and average?

41
Statistical Filters
  • GroupLens Resnick et al 94, MIT
  • Filters UseNet News postings
  • Similarity Pearson correlation
  • Prediction Weighted deviation from mean

42
Pearson Correlation
43
Pearson Correlation
  • Weight between users a and u
  • Compute similarity matrix between users
  • Use Pearson Correlation (-1, 0, 1)
  • Let items be all items that users rated

44
Part IV The reality
44
45
DM Today in Financial Industry
Appl
Credit
Query
Usage
Complain
Leave
Promotion
Entry
Leave
Management
Compete
Home Buying
Work
Married
Birth of Child
Retire
Promoted
46
The Process of data mining
Knowledge
Evaluate
Mine
Transform
Preprocess
Select
Relational Database
47
The practice today an example (Ling and Li,
KDD98)
  • The mailing cost is reduced
  • But the response rate is improved.
  • The net profit is increased dramatically.

48
Specific problems and solutions
  • Extremely imbalanced class distribution
  • E.g. only 1 are positive (buyers), and the rest
    are negative (non-buyers).
  • Evaluation criterion for data mining process
  • The predictive accuracy is no longer suitable.
  • The training set with a large number of variables
    ? can be too large.
  • Efficient learning algorithm is required.
  • Rank training and testing examples
  • We require learning algorithms to produce
    probability estimation or confidence factor.
  • Use lift as the evaluation criterion
  • A lift reflects the redistribution of responders
  • in the testing set after ranking the testing
    examples.

49
Solution lift index for evaluation
  • A typical lift table
  • Use a weighted sum of the items in the lift table
    over the total sum-lift index

50
MTMI 521_at_HKUST Data Mining Knowledge
Management
How to successfully find a date
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps

Team Members
  • James Chung
  • Nelson To
  • Joyce Ngo
  • Michael Lee
  • Alesandro Sicheri

The Dating and Mating Team 27 November, 2004
51
Objectives
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Improve the User Experience
  • Our Objective is to utilise traditional Data
    Mining techniques associate, cluster, and
    classify to answer the following types of very
    important dating questions?
  • Predict my ideal dating partner? Will he/she
    still be with me in the morning?
  • Am I attractive to the opposite sex?
  • Identify the type of person that has a tendency
    to be unfaithful and vice versa.
  • What group has tendency to have long-term
    relationships.

52
Methodology Process (Survey Hardware Topology)
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps

10 minutes from download to Weka
53
Methodology Process
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps

54
Primary Results
Overview
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Data Set (28 Nov 2004) used for report 85
    records and 122 attributes.
  • 48 Men and 37 Women Replied

Issues
  • Disclaimer we may not have enough data to
    discover answers for all our objectives, however,
    we aim to discovery the way to find answers. So,
    if you have not completed the survey then please
    do.

55
Elementary Analysis
Results
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Analysis of statistics produced by Access and
    Weka, led to the following elementary findings
  • 50 have a tendency to accept a one-night stands
  • 33 accept dating two people at the same time
  • Most people like to seek an intelligent partner,
    but 30 of people prefer a not so intelligent
    partner
  • Most people want a long-term relationship (70)
  • Most people are not interested in Intellectual
    Pursuit, Politics, Collecting, and Gardening,
    Dancing
  • The most favorite partner occupation is
    professional
  • The vast majority of people want their partner to
    be slightly eccentric (82)
  • The most desirable partner attributes are
    sincere, patient, humorous, creative (not dull),
    organized
  • Peoples favorite book is Literature and Fiction
  • Peoples favorite film type is Romantic Comedies
  • The most popular ideal Partner sign is Scorpio
  • Most people consider themselves creative and
    organized (slight contradiction)

Observation Does this analysis reflect Hong Kong
Culture? Population type has an influence on the
overall characteristics of the dataset. Would we
get the same results asking a group of
Australians? Data Miners should be aware of
factors that may change the characteristics of
model if applied on heterogeneous datasets.
Issues
56
Associate Analysis
Analysis
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Initial Aim try to determine the association
    between personal characteristics and partner
    characteristics. (Algorithm Apriori)
  • Patterns of limited interest generated because
    most people acted humblely in defining their
    characteristics.
  • 65. Thoughtful_to_Indifferent2
    Organized_to_Spontaneous3 18 gt
    Partner_Selfish_to_Respective4 17 conf(0.94)
  • 398. Selfish_to_Respective4 Creative_to_Dull2
    21 gt Partner_Computer_and_Technologyno 17
    conf(0.81)
  • This pattern tells us that most people who is
    respective and creative dont like their partner
    has hobby in computer and technology

57
Associate Analysis
  • Evolution personal age, sex, height, body type
    , eye color, hair length. Education, occupation,
    smoking, drinking, religion, Married status, want
    children, first language and second language
    attributes
  • 35. Smokingnon Partner_Creative_to_Dull2 40 gt
    Second_Languageenglish 37 conf(0.93)
  • 4428. Smokingnon Religionnone
    First_Languagecantonese Second_Languageenglish
    Partner_Pessimistic_to_Enthusiastic4
    Partner_Conservative_to_Eccentric3 13 gt
    Married_Statussingle 13 conf(1)
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Associations may not align with Domain Knowledge
    (We are still working in this area)
  • Pattern discovered is dependent on the
    attributes you select (Domain Knowledge)

Issues
58
Cluster Analysis
Analysis
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • See if there is any interesting pattern between
    horoscope and double dating. The clustered result
    did not have any specific patterns to be observed
    after 3 iterations and cluster size ranged from 2
    to 10.

The only star sign to be clustered was Virgo
  • Observation Virgo being the Most Faithful?

59
Cluster Analysis
Analysis
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Using classify algorithm, higher relevance was
    found between Sensitive_to_Indifferent and
    Double_Date. Therefore, next step is to use
    clustering to find if there is any interesting
    pattern.

SimpleKMeans N 3 S 10 kMeans Number of
iterations 3 Within cluster sum of squared
errors 32.71890394088671 Cluster
centroids Cluster 0 Mean/Mode male 2.3929
no Std Devs N/A 0.8751 N/A Cluster
1 Mean/Mode female 2.0345 no Std Devs N/A
0.7784 N/A Cluster 2 Mean/Mode male
2.8214 yes Std Devs N/A 0.7724 N/A
Clustered Instances 0 28 ( 33) 1 29 (
34) 2 28 ( 33)
  • Observation Men are indifferent double date

60
Cluster Analysis
Issues
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Clustering for clarity?
  • There might be no meaningful cluster at all!
  • Is there any correct way to cluster?
  • There is no best way for clustering.
  • In this example, application of a different Data
    Mining technique answered the original question
    keep an open mind and try different things.
  • Pain areas
  • How to determine the number of clusters?
  • Attribute significance cannot be determined.
  • Lacks explanation capabilities - use domain
    knowledge and imagination!

61
Classify (Case Study)
Analysis
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps

62
Classify
Our Aim is to predict individuals ideal partner
(An example)
Ops As you can see this is work in progress
63
Key Learning Points
Issues
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • Domain Knowledge is king! You cannot do DM
    without this.
  • What came first, the chicken or the egg? Do you
    set your objectives first or do you use data
    mining to identify them.
  • Garbage in Garbage out Data capture and
    preparation is a critical part of the Data
    Mining process, to ignore it is to invite
    disaster. Discovering the appropriate attributes
    will make or break a Data Mining exercise.
  • Quality versus Quantity Manage trade-off
    between creating a data-rich source with less
    participation versus a data poor-source with
    larger participation.
  • When do we need to retrain the model? What
    factors
  • influence this decision.
  • Trust, but Audit! Walkthrough and consistency
  • check at every stage of Data Mining (ideally
  • involving all members of design team)

64
Key Learning Points
Issues
  • Start
  • Introduction
  • Objectives
  • Methodology Process
  • Primary Results
  • Elementary Analysis
  • Associate
  • Cluster
  • Classify
  • Key Learning Points
  • Finish Next Steps
  • What is an interesting Pattern? How you identify
    them can be subjective and based on Domain
    Knowledge. We are still learning about what to do
    with our data.
  • Never trust the data entered always check it
    and align it with the real world.
  • How do you manage Data Mining Project? That is
    the million dollar question? Although we are
    still working on it, we are improving with every
    moment.
  • Need to Walk before Running Need build up
    individual and team experience before tackling
    difficult Data Mining Tasks. Our team was
    completely confused after the first week, it can
    be argued we still are!
  • Finally Data Mining is not easy! It is an art,
    not a
  • science and therefore alien to engineers. But it
    is
  • fun and exciting when you discover some new.!

65
Current Research At HKUSThttp//www.cs.ust.hk/qy
ang
  • Wireless Data Mining
  • How to detect user behaviors in wireless
    environment?
  • We are the inventor of wireless user-activity
    recognition software
  • Web Data Mining
  • How to classify text documents?
  • We are the winner of 2005 ACM KDDCUP world web
    search competition

66
Conclusions
  • Data Mining is a process, not a tool
  • Data mining can be used in financial industry to
  • Analyze customer profiles
  • Rank customers on credit and risk
  • Structure marketing campaigns
  • Plans for action
Write a Comment
User Comments (0)
About PowerShow.com