Title: Data Mining and Its Applications
1Data Mining and Its Applications
- Qiang Yang
- Hong Kong University of Science and Technology
- qyang_at_cse.ust.hk
- http//www.cse.ust.hk/qyang
- Thanks Jiawei Han, Sunny Chee, Jian Pei, and
many others
2Outline
Part
- The problem
- The (ideal) formulation
- The (ideal) solution
- The reality
- The practice today
- Looking into the future
3 Part I The problem Example 1
- What to do?
- The Data Mining method
- Collect lots of customer data
- Make prediction
- You are a marketing manager for a brokerage
company - Problem Churn is too high (also known as
Attrition) - Turnover is 40 annually
- Have some potential incentives, but
- giving new incentives to everyone who might leave
is very expensive (as well as wasteful) - Bringing back a customer after they leave is both
difficult and costly!
3
4The problem Example 2
- You are running a Web search engine
- You make by selling advertisements
- Each click on an Ad ? 1.00
- How to match query terms to ads?
5The Problem Example 3
- You are running a brokerage firm
- You wish to find out the correlations between
economic factors and stock price - How to predict tomorrows stock market?
- We know quite a bit about the economic factors
- But we dont know how well they combine to take
effect - Can we ask computers to do that for us?
6The problem Example 4
- We wish to design a system to recognize if a user
of our credit card is a thief - How to recognize from the purchases?
- We wish to find our whether a person who log in
to our computer system is a hacker - How to recognize from the sequence of commands?
7The problem Example 5
- You are running an online shop such as Amazon.com
- Your user has bought some books already
- Which other books do you recommend to your user?
- Based on other peoples purchases?
- This requires that you know which two users are
alike - If they are similar to each other, then their
purchases are also similar to each other
8The Problem Example 6
- You have an old parent at home, alone
- You wish to watch out for the parent using
sensors - If something seems wrong, you wish to be
notified, and call the doctor right away - How do you tell if something is out of the norm
and sound alarm in time?
9Part II The (ideal) Formulation
denotes 1 denotes -1
- I can go on and on about problems, but what do
they share in common? - Given a set of objects with known descriptive
variables, and from known classes, i - (Xi, yi), i1, 2, t
- yi 1, or -1
- Prob 1
- Find a discriminative function f(x) such that
f(Xi)yi - Supervised Learning
- Prob 2
- Find clusters of similar data Xi
- Unsupervised Learning
10DM Definition Predictive Model is akin to
- A black box that makes predictions about the
future based on information from the past and
present - Large number of inputs usually available
10
11Convergence of Three Technologies
- Machine learning and data mining algorithms are
improving in dramatic pace - We can handle millions of data in seconds on a
typical PC
11
12How are Models Built and Used?
12
13Predictive Models are
- Decision Trees
- Nearest Neighbor
- Neural Networks
- Rule Induction
- Clustering
- Marketing
- Direct mail marketing
- Web site personalization
- Fraud Detection
- Credit card fraud detection
- Science
- Bioinformatics
- Gene analysis
- Web Text analysis
- Google
13
14Part III Ideal Solutions
- Decision Trees
- Naïve Bayesian
- Support Vector Machines
- K-means Clustering
- Correlation Analysis
- First, decision trees
- Have you played the 20 question game?
15Solution 1 Decision trees for Risk Analysis
(thanks D.B. Chan, S.H. Univ. )
16Classification (thanks D.B. Chan, S.H. Univ. )
1
Decision Tree
Income High
Income Low
D1 D2
D1
D2
17Classification Analysis (thanks D.B. Chan, S.H.
Univ. )
1
2
Decision Tree
D1a D2
Income High
Income Low
D1b
D1
D2
Married No
Married Yes
Poor
D1a
D1b
Poor
Good
18Classification Prediction
- Prediction Phase a new data item arrives
Name Susan, Debt Low, Income High,Married
Yes then what is the Risk factor?
Decision Tree
Income High
Income Low
D1
D2
Married Yes
Married No
D1a
D1b
Good
Poor
So the Risk factor for this customer is Good!
19Classification Tree Can be used for classifying
type of customer
???????
??????????? ? 12 ? 88
??
gt45
45lt
???? ?44 ?56
???? ?10 ?90
??
??
??
??
gt23
23gt
???? ?11 ?89
???? ?48 ?52
???? ?18 ?82
???? ?3 ?97
??
??
??
lt25,000
??
??
??
??
gt25,000
???? ?1 ?99
???? ?13 ?87
???? ?33 ?67
???? ?55 ?45
???? ?24 ?76
???? ?6 ?94
??
??
??
??
??
????
??
????
???
??
lt60,000
gt60,000
???? ?6 ?94
???? ?65 ?35
???? ?35 ?65
???? ?55 ?45
???? ?31 ?69
???? ?8 ?92
???? ?0 ?100
???? ?21 ?79
.
.
.
.
.
good
bad
good
good
good
bad
20Solution 2 Bayesian Methods
21Naïve Bayes Method
- Assume that the attributes are independent given
the class Play
play
outlook
temp
humidity
windy
Pr(outlooksunny windytrue, playyes)
Pr(outlooksunnyplayyes)
22Probabilities weather data
23Naïve Bayesian is very accurate, why? (Domingos,
1997)
24Solution3 Support Vector Machines
- Hard-Margin Linear Classifier
- Maximize Margin
- Support Vectors
- Quadratic Programming
- Soft-Margin Linear Classifier
- Non-Linear Separable Problem and Kernels
- XOR
- Extension to
- Regression for numerical class
- Ranking rather than classification
- SMO and Core vector machines
- Problem
- Given a set of objects with known descriptive
variables, and from known classes, i - (Xi, yi), i1, 2, t
- Find
- a discriminative function f(x) such that
f(Xi)yi. - SVM today
- A must try for most applications
- Mathematically well founded
- Robust to noise (non-support vectors ignored)
- Works even for dozens of training data
- Among the most accurate algorithms
- Has many extensions
- Can be scaled up (ongoing work)
25 Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
26 Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
27Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the maximum margin.
Support Vectors are those datapoints that the
margin pushes up against
28SVM Questions
- Can we understand the meaning of the SVM through
a solid theoretical foundation? - Can we extend the SVM formulation to handle cases
where we allow errors to exist, when even the
best hyperplane must admit some errors on the
training data? - Can we extend the SVM formulation so that it
works in situations where the training data are
not linearly separable? - Can we extend the SVM formulation so that the
task is to rank the instances in the likelihood
of being a positive class member, rather than
classification? - Can we scale up the algorithm for finding the
maximum margin hyperplanes to thousands and
millions of instances?
29Q1. Support Vector Machines Foundations
- The problem of finding the maximum margin can be
transformed to finding the roots of a Lagrangian - Can be solved using quadratic programming (QP)
- Has solid theoretical foundations
- Future error lt Training error Ch1/2
- hVC dimension, which is the max number of
examples shattered by a function class f(x,a)
30Q2 extension to allow errors
When noise exists Minimize w.w C (distance
of error points to their
correct place)
31Q3 Non-linear transformation to Feature spaces
- General idea introduce kernels
F x ? f(x)
32Q4 extension to other tasks?
f(x,w,b) w. x - b
- SVM for ranking
- Ranking SVM
- Idea
- For each order pair of instances (x1, x2) where
x1 lt x2 in ranking - Generate a new instance
- ltx1,x2,1gt
- Train an SVM on the new data set
- SVM for regression analysis
- SV regression
33Q5 Scale Up?
- One of the initial drawbacks of SVM is its
computational inefficiency. - However, this problem is being solved with great
success. - SMO
- break a large optimization problem into a series
of smaller problems, each only involves a couple
of carefully chosen variables - The process iterates
- Core Vector Machines
- finding an approximate minimum enclosing ball of
a set of instances. - These instances, when mapped to an N-dimensional
space, represent a core set - Allows solution in very fast speed.
- Train high quality SVM on millions of data in
seconds
34Solution 4 The K-Means Clustering Method
- Given N customers data
- Can you find out who are similar?
- Can you put all similar users in the same group
automatically?
- K-means clustering
- Assumes that you wish to have K such groups
- Assumes that the data are given without the class
labels - Assume that each dimension of the data has a
concept of a mean value
35The mean point
The mean point can be a virtual point
36The K-Means Clustering Method
10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
37Comments on the K-Means Method
- Strength Relatively efficient
- Comment Often terminates at a local optimum.
- Weakness
- Applicable only when mean is defined, then what
about categorical data? - Need to specify k, the number of clusters, in
advance - Unable to handle noisy data and outliers too well
- Not suitable to discover clusters with non-convex
shapes
What about these?
38Solution 5 Data Mining for Recommendation
- Which movie would Sammy watch next?
- Ratings 1--5
39Statistical Collaborative Filters
- Users annotate items with numeric ratings.
- Users who rate items similarly become mutual
advisors. - Recommendation computed by taking a weighted
aggregate of advisor ratings.
40Basic Idea
- Nearest Neighbor Algorithm
- Given a user a and item i
- First, find the the most similar users to a,
- Let these be Y
- Second, find how these users (Y) ranked i,
- Then, calculate a predicted rating of a on i
based on some average of all these users Y - How to calculate the similarity and average?
41Statistical Filters
- GroupLens Resnick et al 94, MIT
- Filters UseNet News postings
- Similarity Pearson correlation
- Prediction Weighted deviation from mean
42Pearson Correlation
43Pearson Correlation
- Weight between users a and u
- Compute similarity matrix between users
- Use Pearson Correlation (-1, 0, 1)
- Let items be all items that users rated
44Part IV The reality
44
45DM Today in Financial Industry
Appl
Credit
Query
Usage
Complain
Leave
Promotion
Entry
Leave
Management
Compete
Home Buying
Work
Married
Birth of Child
Retire
Promoted
46The Process of data mining
Knowledge
Evaluate
Mine
Transform
Preprocess
Select
Relational Database
47The practice today an example (Ling and Li,
KDD98)
- The mailing cost is reduced
- But the response rate is improved.
- The net profit is increased dramatically.
48Specific problems and solutions
- Extremely imbalanced class distribution
- E.g. only 1 are positive (buyers), and the rest
are negative (non-buyers). - Evaluation criterion for data mining process
- The predictive accuracy is no longer suitable.
- The training set with a large number of variables
? can be too large. - Efficient learning algorithm is required.
- Rank training and testing examples
- We require learning algorithms to produce
probability estimation or confidence factor. - Use lift as the evaluation criterion
- A lift reflects the redistribution of responders
- in the testing set after ranking the testing
examples.
49Solution lift index for evaluation
- A typical lift table
- Use a weighted sum of the items in the lift table
over the total sum-lift index
50MTMI 521_at_HKUST Data Mining Knowledge
Management
How to successfully find a date
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
Team Members
- James Chung
- Nelson To
- Joyce Ngo
- Michael Lee
- Alesandro Sicheri
The Dating and Mating Team 27 November, 2004
51Objectives
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Improve the User Experience
- Our Objective is to utilise traditional Data
Mining techniques associate, cluster, and
classify to answer the following types of very
important dating questions? - Predict my ideal dating partner? Will he/she
still be with me in the morning? - Am I attractive to the opposite sex?
- Identify the type of person that has a tendency
to be unfaithful and vice versa. - What group has tendency to have long-term
relationships.
52Methodology Process (Survey Hardware Topology)
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
10 minutes from download to Weka
53Methodology Process
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
54Primary Results
Overview
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Data Set (28 Nov 2004) used for report 85
records and 122 attributes. - 48 Men and 37 Women Replied
Issues
- Disclaimer we may not have enough data to
discover answers for all our objectives, however,
we aim to discovery the way to find answers. So,
if you have not completed the survey then please
do.
55Elementary Analysis
Results
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Analysis of statistics produced by Access and
Weka, led to the following elementary findings - 50 have a tendency to accept a one-night stands
- 33 accept dating two people at the same time
- Most people like to seek an intelligent partner,
but 30 of people prefer a not so intelligent
partner - Most people want a long-term relationship (70)
- Most people are not interested in Intellectual
Pursuit, Politics, Collecting, and Gardening,
Dancing - The most favorite partner occupation is
professional - The vast majority of people want their partner to
be slightly eccentric (82) - The most desirable partner attributes are
sincere, patient, humorous, creative (not dull),
organized - Peoples favorite book is Literature and Fiction
- Peoples favorite film type is Romantic Comedies
- The most popular ideal Partner sign is Scorpio
- Most people consider themselves creative and
organized (slight contradiction)
Observation Does this analysis reflect Hong Kong
Culture? Population type has an influence on the
overall characteristics of the dataset. Would we
get the same results asking a group of
Australians? Data Miners should be aware of
factors that may change the characteristics of
model if applied on heterogeneous datasets.
Issues
56Associate Analysis
Analysis
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Initial Aim try to determine the association
between personal characteristics and partner
characteristics. (Algorithm Apriori) - Patterns of limited interest generated because
most people acted humblely in defining their
characteristics. - 65. Thoughtful_to_Indifferent2
Organized_to_Spontaneous3 18 gt
Partner_Selfish_to_Respective4 17 conf(0.94) - 398. Selfish_to_Respective4 Creative_to_Dull2
21 gt Partner_Computer_and_Technologyno 17
conf(0.81) - This pattern tells us that most people who is
respective and creative dont like their partner
has hobby in computer and technology
57Associate Analysis
- Evolution personal age, sex, height, body type
, eye color, hair length. Education, occupation,
smoking, drinking, religion, Married status, want
children, first language and second language
attributes - 35. Smokingnon Partner_Creative_to_Dull2 40 gt
Second_Languageenglish 37 conf(0.93) - 4428. Smokingnon Religionnone
First_Languagecantonese Second_Languageenglish
Partner_Pessimistic_to_Enthusiastic4
Partner_Conservative_to_Eccentric3 13 gt
Married_Statussingle 13 conf(1)
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Associations may not align with Domain Knowledge
(We are still working in this area) - Pattern discovered is dependent on the
attributes you select (Domain Knowledge)
Issues
58Cluster Analysis
Analysis
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- See if there is any interesting pattern between
horoscope and double dating. The clustered result
did not have any specific patterns to be observed
after 3 iterations and cluster size ranged from 2
to 10.
The only star sign to be clustered was Virgo
- Observation Virgo being the Most Faithful?
59Cluster Analysis
Analysis
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Using classify algorithm, higher relevance was
found between Sensitive_to_Indifferent and
Double_Date. Therefore, next step is to use
clustering to find if there is any interesting
pattern.
SimpleKMeans N 3 S 10 kMeans Number of
iterations 3 Within cluster sum of squared
errors 32.71890394088671 Cluster
centroids Cluster 0 Mean/Mode male 2.3929
no Std Devs N/A 0.8751 N/A Cluster
1 Mean/Mode female 2.0345 no Std Devs N/A
0.7784 N/A Cluster 2 Mean/Mode male
2.8214 yes Std Devs N/A 0.7724 N/A
Clustered Instances 0 28 ( 33) 1 29 (
34) 2 28 ( 33)
- Observation Men are indifferent double date
60Cluster Analysis
Issues
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Clustering for clarity?
- There might be no meaningful cluster at all!
- Is there any correct way to cluster?
- There is no best way for clustering.
- In this example, application of a different Data
Mining technique answered the original question
keep an open mind and try different things. - Pain areas
- How to determine the number of clusters?
- Attribute significance cannot be determined.
- Lacks explanation capabilities - use domain
knowledge and imagination!
61Classify (Case Study)
Analysis
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
62Classify
Our Aim is to predict individuals ideal partner
(An example)
Ops As you can see this is work in progress
63Key Learning Points
Issues
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- Domain Knowledge is king! You cannot do DM
without this. - What came first, the chicken or the egg? Do you
set your objectives first or do you use data
mining to identify them. - Garbage in Garbage out Data capture and
preparation is a critical part of the Data
Mining process, to ignore it is to invite
disaster. Discovering the appropriate attributes
will make or break a Data Mining exercise. - Quality versus Quantity Manage trade-off
between creating a data-rich source with less
participation versus a data poor-source with
larger participation. - When do we need to retrain the model? What
factors - influence this decision.
- Trust, but Audit! Walkthrough and consistency
- check at every stage of Data Mining (ideally
- involving all members of design team)
64Key Learning Points
Issues
- Start
- Introduction
- Objectives
- Methodology Process
- Primary Results
- Elementary Analysis
- Associate
- Cluster
- Classify
- Key Learning Points
- Finish Next Steps
- What is an interesting Pattern? How you identify
them can be subjective and based on Domain
Knowledge. We are still learning about what to do
with our data. - Never trust the data entered always check it
and align it with the real world. - How do you manage Data Mining Project? That is
the million dollar question? Although we are
still working on it, we are improving with every
moment. - Need to Walk before Running Need build up
individual and team experience before tackling
difficult Data Mining Tasks. Our team was
completely confused after the first week, it can
be argued we still are! - Finally Data Mining is not easy! It is an art,
not a - science and therefore alien to engineers. But it
is - fun and exciting when you discover some new.!
65Current Research At HKUSThttp//www.cs.ust.hk/qy
ang
- Wireless Data Mining
- How to detect user behaviors in wireless
environment? - We are the inventor of wireless user-activity
recognition software - Web Data Mining
- How to classify text documents?
- We are the winner of 2005 ACM KDDCUP world web
search competition
66Conclusions
- Data Mining is a process, not a tool
- Data mining can be used in financial industry to
- Analyze customer profiles
- Rank customers on credit and risk
- Structure marketing campaigns
- Plans for action