Data Mining and Its Applications presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data Mining and Its Applications

1
Data Mining and Its Applications

Qiang Yang
Hong Kong University of Science and Technology
qyang_at_cse.ust.hk
http//www.cse.ust.hk/qyang
Thanks Jiawei Han, Sunny Chee, Jian Pei, and
many others

2
Outline
Part

The problem
The (ideal) formulation
The (ideal) solution
The reality
The practice today
Looking into the future

3
Part I The problem Example 1

What to do?
The Data Mining method
Collect lots of customer data
Make prediction

You are a marketing manager for a brokerage
company
Problem Churn is too high (also known as
Attrition)
Turnover is 40 annually
Have some potential incentives, but
giving new incentives to everyone who might leave
is very expensive (as well as wasteful)
Bringing back a customer after they leave is both
difficult and costly!

3
4
The problem Example 2

You are running a Web search engine
You make by selling advertisements
Each click on an Ad ? 1.00
How to match query terms to ads?

Try google query bank

5
The Problem Example 3

You are running a brokerage firm
You wish to find out the correlations between
economic factors and stock price
How to predict tomorrows stock market?

We know quite a bit about the economic factors
But we dont know how well they combine to take
effect
Can we ask computers to do that for us?

6
The problem Example 4

We wish to design a system to recognize if a user
of our credit card is a thief
How to recognize from the purchases?

We wish to find our whether a person who log in
to our computer system is a hacker
How to recognize from the sequence of commands?

7
The problem Example 5

You are running an online shop such as Amazon.com
Your user has bought some books already
Which other books do you recommend to your user?
Based on other peoples purchases?

This requires that you know which two users are
alike
If they are similar to each other, then their
purchases are also similar to each other

8
The Problem Example 6

You have an old parent at home, alone
You wish to watch out for the parent using
sensors
If something seems wrong, you wish to be
notified, and call the doctor right away
How do you tell if something is out of the norm
and sound alarm in time?

9
Part II The (ideal) Formulation
denotes 1 denotes -1

I can go on and on about problems, but what do
they share in common?
Given a set of objects with known descriptive
variables, and from known classes, i
(Xi, yi), i1, 2, t
yi 1, or -1
Prob 1
Find a discriminative function f(x) such that
f(Xi)yi
Supervised Learning
Prob 2
Find clusters of similar data Xi
Unsupervised Learning

10
DM Definition Predictive Model is akin to

A black box that makes predictions about the
future based on information from the past and
present
Large number of inputs usually available

10
11
Convergence of Three Technologies

Machine learning and data mining algorithms are
improving in dramatic pace
We can handle millions of data in seconds on a
typical PC

11
12
How are Models Built and Used?

View from 20,000 feet

12
13
Predictive Models are

Decision Trees
Nearest Neighbor
Neural Networks
Rule Induction
Clustering

Marketing
Direct mail marketing
Web site personalization
Fraud Detection
Credit card fraud detection
Science
Bioinformatics
Gene analysis
Web Text analysis
Google

13
14
Part III Ideal Solutions

Decision Trees
Naïve Bayesian
Support Vector Machines
K-means Clustering
Correlation Analysis

First, decision trees
Have you played the 20 question game?

15
Solution 1 Decision trees for Risk Analysis
(thanks D.B. Chan, S.H. Univ. )

Cases

16
Classification (thanks D.B. Chan, S.H. Univ. )

Training Phase

1
Decision Tree
Income High
Income Low
D1 D2
D1
D2
17
Classification Analysis (thanks D.B. Chan, S.H.
Univ. )

Training Phase

1
2
Decision Tree
D1a D2
Income High
Income Low
D1b
D1
D2
Married No
Married Yes
Poor
D1a
D1b
Poor
Good
18
Classification Prediction

Prediction Phase a new data item arrives

Name Susan, Debt Low, Income High,Married
Yes then what is the Risk factor?
Decision Tree
Income High
Income Low
D1
D2
Married Yes
Married No
D1a
D1b
Good
Poor
So the Risk factor for this customer is Good!
19
Classification Tree Can be used for classifying
type of customer
???????
??????????? ? 12 ? 88
??
gt45
45lt
???? ?44 ?56
???? ?10 ?90
??
??
??
??
gt23
23gt
???? ?11 ?89
???? ?48 ?52
???? ?18 ?82
???? ?3 ?97
??
??
??
lt25,000
??
??
??
??
gt25,000
???? ?1 ?99
???? ?13 ?87
???? ?33 ?67
???? ?55 ?45
???? ?24 ?76
???? ?6 ?94
??
??
??
??
??
????
??
????
???
??
lt60,000
gt60,000
???? ?6 ?94
???? ?65 ?35
???? ?35 ?65
???? ?55 ?45
???? ?31 ?69
???? ?8 ?92
???? ?0 ?100
???? ?21 ?79
.
.
.
.
.
good
bad
good
good
good
bad
20
Solution 2 Bayesian Methods
21
Naïve Bayes Method

Assume that the attributes are independent given
the class Play

play
outlook
temp
humidity
windy
Pr(outlooksunny windytrue, playyes)
Pr(outlooksunnyplayyes)
22
Probabilities weather data

A new day

23
Naïve Bayesian is very accurate, why? (Domingos,
1997)
24
Solution3 Support Vector Machines

Hard-Margin Linear Classifier
Maximize Margin
Support Vectors
Quadratic Programming
Soft-Margin Linear Classifier
Non-Linear Separable Problem and Kernels
XOR
Extension to
Regression for numerical class
Ranking rather than classification
SMO and Core vector machines

Problem
Given a set of objects with known descriptive
variables, and from known classes, i
(Xi, yi), i1, 2, t
Find
a discriminative function f(x) such that
f(Xi)yi.
SVM today
A must try for most applications
Mathematically well founded
Robust to noise (non-support vectors ignored)
Works even for dozens of training data
Among the most accurate algorithms
Has many extensions
Can be scaled up (ongoing work)

25
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
26
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
27
Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the maximum margin.
Support Vectors are those datapoints that the
margin pushes up against
28
SVM Questions

Can we understand the meaning of the SVM through
a solid theoretical foundation?
Can we extend the SVM formulation to handle cases
where we allow errors to exist, when even the
best hyperplane must admit some errors on the
training data?
Can we extend the SVM formulation so that it
works in situations where the training data are
not linearly separable?
Can we extend the SVM formulation so that the
task is to rank the instances in the likelihood
of being a positive class member, rather than
classification?
Can we scale up the algorithm for finding the
maximum margin hyperplanes to thousands and
millions of instances?

29
Q1. Support Vector Machines Foundations

The problem of finding the maximum margin can be
transformed to finding the roots of a Lagrangian
Can be solved using quadratic programming (QP)
Has solid theoretical foundations
Future error lt Training error Ch1/2
hVC dimension, which is the max number of
examples shattered by a function class f(x,a)

30
Q2 extension to allow errors
When noise exists Minimize w.w C (distance
of error points to their
correct place)
31
Q3 Non-linear transformation to Feature spaces

General idea introduce kernels

F x ? f(x)
32
Q4 extension to other tasks?
f(x,w,b) w. x - b

SVM for ranking
Ranking SVM
Idea
For each order pair of instances (x1, x2) where
x1 lt x2 in ranking
Generate a new instance
ltx1,x2,1gt
Train an SVM on the new data set

SVM for regression analysis
SV regression

33
Q5 Scale Up?

One of the initial drawbacks of SVM is its
computational inefficiency.
However, this problem is being solved with great
success.
SMO
break a large optimization problem into a series
of smaller problems, each only involves a couple
of carefully chosen variables
The process iterates
Core Vector Machines
finding an approximate minimum enclosing ball of
a set of instances.
These instances, when mapped to an N-dimensional
space, represent a core set
Allows solution in very fast speed.
Train high quality SVM on millions of data in
seconds

34
Solution 4 The K-Means Clustering Method

Given N customers data
Can you find out who are similar?
Can you put all similar users in the same group
automatically?

K-means clustering
Assumes that you wish to have K such groups
Assumes that the data are given without the class
labels
Assume that each dimension of the data has a
concept of a mean value

35
The mean point
The mean point can be a virtual point
36
The K-Means Clustering Method

Example

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
37
Comments on the K-Means Method

Strength Relatively efficient
Comment Often terminates at a local optimum.
Weakness
Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers too well
Not suitable to discover clusters with non-convex
shapes

What about these?
38
Solution 5 Data Mining for Recommendation

Which movie would Sammy watch next?
Ratings 1--5

39
Statistical Collaborative Filters

Users annotate items with numeric ratings.
Users who rate items similarly become mutual
advisors.
Recommendation computed by taking a weighted
aggregate of advisor ratings.

40
Basic Idea

Nearest Neighbor Algorithm
Given a user a and item i
First, find the the most similar users to a,
Let these be Y
Second, find how these users (Y) ranked i,
Then, calculate a predicted rating of a on i
based on some average of all these users Y
How to calculate the similarity and average?

41
Statistical Filters

GroupLens Resnick et al 94, MIT
Filters UseNet News postings
Similarity Pearson correlation
Prediction Weighted deviation from mean

42
Pearson Correlation
43
Pearson Correlation

Weight between users a and u
Compute similarity matrix between users
Use Pearson Correlation (-1, 0, 1)
Let items be all items that users rated

44
Part IV The reality
44
45
DM Today in Financial Industry
Appl
Credit
Query
Usage
Complain
Leave
Promotion
Entry
Leave
Management
Compete
Home Buying
Work
Married
Birth of Child
Retire
Promoted
46
The Process of data mining
Knowledge
Evaluate
Mine
Transform
Preprocess
Select
Relational Database
47
The practice today an example (Ling and Li,
KDD98)

The mailing cost is reduced
But the response rate is improved.
The net profit is increased dramatically.

48
Specific problems and solutions

Extremely imbalanced class distribution
E.g. only 1 are positive (buyers), and the rest
are negative (non-buyers).
Evaluation criterion for data mining process
The predictive accuracy is no longer suitable.
The training set with a large number of variables
? can be too large.
Efficient learning algorithm is required.

Rank training and testing examples
We require learning algorithms to produce
probability estimation or confidence factor.
Use lift as the evaluation criterion
A lift reflects the redistribution of responders
in the testing set after ranking the testing
examples.

49
Solution lift index for evaluation

A typical lift table
Use a weighted sum of the items in the lift table
over the total sum-lift index

50
MTMI 521_at_HKUST Data Mining Knowledge
Management
How to successfully find a date

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Team Members

James Chung
Nelson To
Joyce Ngo
Michael Lee
Alesandro Sicheri

The Dating and Mating Team 27 November, 2004
51
Objectives

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Improve the User Experience
Our Objective is to utilise traditional Data
Mining techniques associate, cluster, and
classify to answer the following types of very
important dating questions?
Predict my ideal dating partner? Will he/she
still be with me in the morning?
Am I attractive to the opposite sex?
Identify the type of person that has a tendency
to be unfaithful and vice versa.
What group has tendency to have long-term
relationships.

52
Methodology Process (Survey Hardware Topology)

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

10 minutes from download to Weka
53
Methodology Process

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

54
Primary Results
Overview

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Data Set (28 Nov 2004) used for report 85
records and 122 attributes.
48 Men and 37 Women Replied

Issues

Disclaimer we may not have enough data to
discover answers for all our objectives, however,
we aim to discovery the way to find answers. So,
if you have not completed the survey then please
do.

55
Elementary Analysis
Results

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Analysis of statistics produced by Access and
Weka, led to the following elementary findings
50 have a tendency to accept a one-night stands
33 accept dating two people at the same time
Most people like to seek an intelligent partner,
but 30 of people prefer a not so intelligent
partner
Most people want a long-term relationship (70)
Most people are not interested in Intellectual
Pursuit, Politics, Collecting, and Gardening,
Dancing
The most favorite partner occupation is
professional
The vast majority of people want their partner to
be slightly eccentric (82)
The most desirable partner attributes are
sincere, patient, humorous, creative (not dull),
organized
Peoples favorite book is Literature and Fiction
Peoples favorite film type is Romantic Comedies
The most popular ideal Partner sign is Scorpio
Most people consider themselves creative and
organized (slight contradiction)

Observation Does this analysis reflect Hong Kong
Culture? Population type has an influence on the
overall characteristics of the dataset. Would we
get the same results asking a group of
Australians? Data Miners should be aware of
factors that may change the characteristics of
model if applied on heterogeneous datasets.
Issues
56
Associate Analysis
Analysis

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Initial Aim try to determine the association
between personal characteristics and partner
characteristics. (Algorithm Apriori)
Patterns of limited interest generated because
most people acted humblely in defining their
characteristics.
65. Thoughtful_to_Indifferent2
Organized_to_Spontaneous3 18 gt
Partner_Selfish_to_Respective4 17 conf(0.94)
398. Selfish_to_Respective4 Creative_to_Dull2
21 gt Partner_Computer_and_Technologyno 17
conf(0.81)
This pattern tells us that most people who is
respective and creative dont like their partner
has hobby in computer and technology

57
Associate Analysis

Evolution personal age, sex, height, body type
, eye color, hair length. Education, occupation,
smoking, drinking, religion, Married status, want
children, first language and second language
attributes
35. Smokingnon Partner_Creative_to_Dull2 40 gt
Second_Languageenglish 37 conf(0.93)
4428. Smokingnon Religionnone
First_Languagecantonese Second_Languageenglish
Partner_Pessimistic_to_Enthusiastic4
Partner_Conservative_to_Eccentric3 13 gt
Married_Statussingle 13 conf(1)

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Associations may not align with Domain Knowledge
(We are still working in this area)
Pattern discovered is dependent on the
attributes you select (Domain Knowledge)

Issues
58
Cluster Analysis
Analysis

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

See if there is any interesting pattern between
horoscope and double dating. The clustered result
did not have any specific patterns to be observed
after 3 iterations and cluster size ranged from 2
to 10.

The only star sign to be clustered was Virgo

Observation Virgo being the Most Faithful?

59
Cluster Analysis
Analysis

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Using classify algorithm, higher relevance was
found between Sensitive_to_Indifferent and
Double_Date. Therefore, next step is to use
clustering to find if there is any interesting
pattern.

SimpleKMeans N 3 S 10 kMeans Number of
iterations 3 Within cluster sum of squared
errors 32.71890394088671 Cluster
centroids Cluster 0 Mean/Mode male 2.3929
no Std Devs N/A 0.8751 N/A Cluster
1 Mean/Mode female 2.0345 no Std Devs N/A
0.7784 N/A Cluster 2 Mean/Mode male
2.8214 yes Std Devs N/A 0.7724 N/A
Clustered Instances 0 28 ( 33) 1 29 (
34) 2 28 ( 33)

Observation Men are indifferent double date

60
Cluster Analysis
Issues

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Clustering for clarity?
There might be no meaningful cluster at all!
Is there any correct way to cluster?
There is no best way for clustering.
In this example, application of a different Data
Mining technique answered the original question
keep an open mind and try different things.
Pain areas
How to determine the number of clusters?
Attribute significance cannot be determined.
Lacks explanation capabilities - use domain
knowledge and imagination!

61
Classify (Case Study)
Analysis

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

62
Classify
Our Aim is to predict individuals ideal partner
(An example)
Ops As you can see this is work in progress
63
Key Learning Points
Issues

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

Domain Knowledge is king! You cannot do DM
without this.
What came first, the chicken or the egg? Do you
set your objectives first or do you use data
mining to identify them.
Garbage in Garbage out Data capture and
preparation is a critical part of the Data
Mining process, to ignore it is to invite
disaster. Discovering the appropriate attributes
will make or break a Data Mining exercise.
Quality versus Quantity Manage trade-off
between creating a data-rich source with less
participation versus a data poor-source with
larger participation.
When do we need to retrain the model? What
factors
influence this decision.
Trust, but Audit! Walkthrough and consistency
check at every stage of Data Mining (ideally
involving all members of design team)

64
Key Learning Points
Issues

Start
Introduction
Objectives
Methodology Process
Primary Results
Elementary Analysis
Associate
Cluster
Classify
Key Learning Points
Finish Next Steps

What is an interesting Pattern? How you identify
them can be subjective and based on Domain
Knowledge. We are still learning about what to do
with our data.
Never trust the data entered always check it
and align it with the real world.
How do you manage Data Mining Project? That is
the million dollar question? Although we are
still working on it, we are improving with every
moment.
Need to Walk before Running Need build up
individual and team experience before tackling
difficult Data Mining Tasks. Our team was
completely confused after the first week, it can
be argued we still are!
Finally Data Mining is not easy! It is an art,
not a
science and therefore alien to engineers. But it
is
fun and exciting when you discover some new.!

65
Current Research At HKUSThttp//www.cs.ust.hk/qy
ang

Wireless Data Mining
How to detect user behaviors in wireless
environment?
We are the inventor of wireless user-activity
recognition software
Web Data Mining
How to classify text documents?
We are the winner of 2005 ACM KDDCUP world web
search competition

66
Conclusions

Data Mining is a process, not a tool
Data mining can be used in financial industry to
Analyze customer profiles
Rank customers on credit and risk
Structure marketing campaigns
Plans for action

Write a Comment

User Comments (0)

About PowerShow.com

Data Mining and Its Applications PowerPoint PPT Presentation