Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl) - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Description:

All items forced into a cluster. Sensitive to outliers. Extensions. Adaptive k-means ... {MILK, BREAD} {EGGS} Frequency / importance = 2 ( Support') Quality = 50 ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 62

Provided by: lia9

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

1
Lecture 3Descriptive Data MiningPeter van der
Putten(putten_at_liacs.nl)
2
Course Outline

Objective
Understand the basics of data mining
Gain understanding of the potential for applying
it in the bioinformatics domain
Hands on experience
Schedule
Evaluation
Practical assignment (2nd) plus take home
exercise
Website
http//www.liacs.nl/putten/edu/dbdm05/

3
Agenda TodayDescriptive Data Mining

Before Starting to Mine.
Descriptive Data Mining
Dimension Reduction Projection
Clustering
Hierarchical clustering
K-means
Self organizing maps
Association rules
Frequent item sets
Association Rules
APRIORI
Bio-informatics case FSG for frequent subgraph
discovery

4
Before starting to mine.

Pima Indians Diabetes Data
X body mass index
Y age

5
Before starting to mine.
6
Before starting to mine.
7
Before starting to mine.

Attribute Selection
This example InfoGain by Attribute
Keep the most important ones

8
Before starting to mine.

Types of Attribute Selection
Uni-variate versus multivariate (sub set
selection)
The fact that attribute x is a strong uni-variate
predictor does not necessarily mean it will add
predictive power to a set of predictors already
used by a model
Filter versus wrapper
Wrapper methods involve the subsequent learner
(classifier or other)

9
Dimension Reduction

Projecting high dimensional data into a lower
dimension
Principal Component Analysis
Independent Component Analysis
Fisher Mapping, Sammons Mapping etc.
Multi Dimensional Scaling
See Pattern Recognition Course (Duin)

10
Data Mining Tasks Clustering
Clustering is the discovery of groups in a set of
instances Groups are different, instances in a
group are similar In 2 to 3 dimensional pattern
space you could just visualise the data and leave
the recognition to a human end user
f.e. weight
f.e. age
11
Data Mining Tasks Clustering
Clustering is the discovery of groups in a set of
instances Groups are different, instances in a
group are similar In 2 to 3 dimensional pattern
space you could just visualise the data and leave
the recognition to a human end user In gt3
dimensions this is not possible
f.e. weight
f.e. age
12
Clustering Techniques

Hierarchical algorithms
Agglomerative
Divisive
Partition based clustering
K-Means
Self Organizing Maps / Kohonen Networks
Probabilistic Model based
Expectation Maximization / Mixture Models

13
Hierarchical clustering

Agglomerative / Bottom up
Start with single-instance clusters
At each step, join the two closest clusters
Method to compute distance between cluster x and
y single linkage (distance between closest point
in cluster x and y), average linkage (average
distance between all points), complete linkage
(distance between furthest points), centroid
Distance measure Euclidean, Correlation etc.
Divisive / Top Down
Start with all data in one cluster
Split into two clusters based on category utility
Proceed recursively on each subset
Both methods produce a dendrogram

14
Levels of Clustering
Agglomerative
Divisive
Dunham, 2003
15
Hierarchical Clustering Example

Clustering Microarray Gene Expression Data
Gene expression measured using microarrays
studied under variety of conditions
On budding yeast Saccharomyces cerevisiae
Groups together efficiently genes of known
similar function,

16
Hierarchical Clustering Example

Method
Genes are the instances, samples the attributes!
Agglomerative
Distance measure correlation

17
Simple Clustering K-means

Pick a number (k) of cluster centers (at random)
Cluster centers are sometimes called codes, and
the k codes a codebook
Assign every item to its nearest cluster center
F.i. Euclidean distance
Move each cluster center to the mean of its
assigned items
Repeat until convergence
change in cluster assignments less than a
threshold

KDnuggets
18
K-means example, step 1
Initially distribute codes randomly in
pattern space
KDnuggets
19
K-means example, step 2
Assign each point to the closest code
KDnuggets
20
K-means example, step 3
Move each code to the mean of all its assigned
points
KDnuggets
21
K-means example, step 2
Repeat the process reassign the data points to
the codes Q Which points are reassigned?
KDnuggets
22
K-means example
Repeat the process reassign the data points to
the codes Q Which points are reassigned?
KDnuggets
23
K-means example
re-compute cluster means
KDnuggets
24
K-means example
move cluster centers to cluster means
KDnuggets
25
K-means clustering summary

Advantages
Simple, understandable
items automatically assigned to clusters

Disadvantages
Must pick number of clusters before hand
All items forced into a cluster
Sensitive to outliers

Extensions
Adaptive k-means
K-mediods (based on median instead of mean)
1,2,3,4,100 ? average 22, median 3

26
Biological Example

Clustering of yeast cell images
Two clusters are found
Left cluster primarily cells with thick capsule,
right cluster thin capsule
caused by media, proxy for sick vs healthy

27
Self Organizing Maps(Kohonen Maps)

Claim to fame
Simplified models of cortical maps in the brain
Things that are near in the outside world link to
areas near in the cortex
For a variety of modalities touch, motor, . up
to echolocation
Nice visualization

From a data mining perspective
SOMs are simple extensions of k-means clustering
Codes are connected in a lattice
In each iteration codes neighboring winning code
in the lattice are also allowed to move

28
SOM
10x10 SOM Gaussian Distribution
29
SOM
30
SOM
31
SOM
32
SOM example
33
Famous examplePhonetic Typewriter

SOM lattice below left is trained on spoken
letters, after convergence codes are labeled
Creates a phonotopic map
Spoken word creates a sequence of labels

34
Famous examplePhonetic Typewriter

Criticism
Topology preserving property is not used so why
use SOMs and not adaptive k-means for instance?
K-means could also create a sequence
This is true for most SOM applications!
Is using clustering for classification optimal?

35
Bioinformatics ExampleClustering GPCRs

Clustering G Protein Coupled Receptors (GPCRs)
Samsanova et al, 2003, 2004
Important drug target, function often unknown

36
Bioinformatics ExampleClustering GPCRs
37
Association Rules Outline

What are frequent item sets association rules?
Quality measures
support, confidence, lift
How to find item sets efficiently?
APRIORI
How to generate association rules from an item
set?
Biological examples

KDnuggets
38
Market Basket ExampleGene Expression Example

Frequent item set
MILK, BREAD 4
Association rule
MILK, BREAD ? EGGS
Frequency / importance 2 (Support)
Quality 50 (Confidence)

What genes are expressed (active) together?
Interaction / regulation
Similar function

39
Association Rule Definitions

Set of items II1,I2,,Im
Transactions Dt1,t2, , tn, tj? I
Itemset Ii1,Ii2, , Iik ? I
Support of an itemset Percentage of transactions
which contain that itemset.
Large (Frequent) itemset Itemset whose number of
occurrences is above a threshold.

Dunham, 2003
40
Frequent Item Set Example
I Beer, Bread, Jelly, Milk,
PeanutButter Support of Bread,PeanutButter is
60
Dunham, 2003
41
Association Rule Definitions

Association Rule (AR) implication X ? Y where
X,Y ? I and X,Y disjunct
Support of AR (s) X ? Y Percentage of
transactions that contain X ?Y
Confidence of AR (a) X ? Y Ratio of number of
transactions that contain X ? Y to the number
that contain X

Dunham, 2003
42
Association Rules Ex (contd)
Dunham, 2003
43
Association Rule Problem

Given a set of items II1,I2,,Im and a
database of transactions Dt1,t2, , tn where
tiIi1,Ii2, , Iik and Iij ? I, the Association
Rule Problem is to identify all association rules
X ? Y with a minimum support and confidence.
NOTE Support of X ? Y is same as support of X ?
Y.

Dunham, 2003
44
Association Rules Example

Q Given frequent set A,B,E, what association
rules have minsup 2 and minconf 50 ?
A, B gt E conf2/4 50
A, E gt B conf2/2 100
B, E gt A conf2/2 100
E gt A, B conf2/2 100
Dont qualify
A gtB, E conf2/6 33lt 50
B gt A, E conf2/7 28 lt 50
__ gt A,B,E conf 2/9 22 lt 50

KDnuggets
45
Solution Association Rule Problem

First, find all frequent itemsets with sup
gtminsup
Exhaustive search wont work
Assume we have a set of m items ? 2m subsets!
Exploit the subset property (APRIORI algorithm)
For every frequent item set, derive rules with
confidence gt minconf

KDnuggets
46
Finding itemsets next level

Apriori algorithm (Agrawal Srikant)
Idea use one-item sets to generate two-item
sets, two-item sets to generate three-item sets,
..
Subset Property If (A B) is a frequent item set,
then (A) and (B) have to be frequent item sets as
well!
In general if X is frequent k-item set, then all
(k-1)-item subsets of X are also frequent
Compute k-item set by merging (k-1)-item sets

KDnuggets
47
An example

Given five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
Candidate four-item sets
(A B C D) Q OK?
A yes, because all 3-item subsets are frequent
(A C D E) Q OK?
A No, because (C D E) is not frequent

KDnuggets
48
From Frequent Itemsets to Association Rules

Q Given frequent set A,B,E, what are possible
association rules?
A gt B, E
A, B gt E
A, E gt B
B gt A, E
B, E gt A
E gt A, B
__ gt A,B,E (empty rule), or true gt A,B,E

KDnuggets
49
Example Generating Rules from an Itemset

Frequent itemset from golf data
Seven potential rules

Humidity Normal, Windy False, Play Yes (4)
If Humidity Normal and Windy False then Play Yes If Humidity Normal and Play Yes then Windy False If Windy False and Play Yes then Humidity Normal If Humidity Normal then Windy False and Play Yes If Windy False then Humidity Normal and Play Yes If Play Yes then Humidity Normal and Windy False If True then Humidity Normal and Windy False and Play Yes 4/4 4/6 4/6 4/7 4/8 4/9 4/12
KDnuggets
50
ExampleGenerating Rules

Rules with support gt 1 and confidence 100
In total 3 rules with support four, 5 with
support three, and 50 with support two

Association rule Sup. Conf.
1 HumidityNormal WindyFalse ?PlayYes 4 100
2 TemperatureCool ?HumidityNormal 4 100
3 OutlookOvercast ?PlayYes 4 100
4 TemperatureCold PlayYes ?HumidityNormal 3 100
... ... ... ... ...
58 OutlookSunny TemperatureHot ?HumidityHigh 2 100
KDnuggets
51
Weka associations output
KDnuggets
52
Extensions and Challenges

Extra quality measure Lift
The lift of an association rule I gt J is defined
as
lift P(JI) / P(J)
Note, P(I) (support of I) / (no. of
transactions)
ratio of confidence to expected confidence
Interpretation
if lift gt 1, then I and J are positively
correlated
lift lt 1, then I are J are negatively
correlated.
lift 1, then I and J are
independent
Other measures for interestingness
A ? B, B ? C, but not A ? C
Efficient algorithms
Known Problem
What to do with all these rules? How to exploit /
make useful / actionable?

KDnuggets
53
Biomedical ApplicationHead and Neck Cancer
Example

1. ace270 fiveyralive 381 ? tumorbefore0
372 conf(0.98)
2. genderM ace270 467 ? tumorbefore0
455 conf(0.97)
3. ace270 588 ? tumorbefore0 572
conf(0.97)
4. tnmT0N0M0 ace270 405 ? tumorbefore0 391
conf(0.97)
5. locLOC7 tumorbefore0 409 ? tnmT0N0M0 391
conf(0.96)
6. locLOC7 442 ? tnmT0N0M0 422
conf(0.95)
7. locLOC7 genderM tumorbefore0 374?
tnmT0N0M0 357 conf(0.95)
8. locLOC7 genderM 406 ? tnmT0N0M0 387
conf(0.95)
9. genderM fiveyralive 633 ? tumorbefore0 595
conf(0.94)
10. fiveyralive 778 ? tumorbefore0 726
conf(0.93)

54
Bioinformatics Application

The idea of association rules have been
customized for bioinformatics applications
In biology it is often interesting to find
frequent structures rather than items
For instance protein or other chemical structures
Solution Mining Frequent Patterns
FSG (Kuramochi and Karypis, ICDM 2001)
gSpan (Yan and Han, ICDM 2002)
CloseGraph (Yan and Han, KDD 2002)

55
FSG Mining Frequent Patterns
56
FSG Mining Frequent Patterns
57
FSG Algorithmfor finding frequent subgraphs
58
Frequent Subgraph ExamplesAIDS Data

Compounds are active, inactive or moderately
active (CA, CI, CM)

59
Predictive Subgraphs

The three most discriminating sub-structures
forthe PTC, AIDS, and Anthrax datasets

60
Experiments and ResultsAIDS Data
61
FSG References

Frequent Sub-structure Based Approaches for
Classifying Chemical CompoundsMukund Deshpande,
Michihiro Kuramochi, and George KarypisICDM 2003
An Efficient Algorithm for Discovering Frequent
SubgraphsMichihiro Kuramochi and George
KarypisIEEE TKDE
Automated Approaches for Classifying
StructuresMukund Deshpande, Michihiro Kuramochi,
and George KarypisBIOKDD 2002
Discovering Frequent Geometric SubgraphsMichihiro
Kuramochi and George KarypisICDM 2002
Frequent Subgraph Discovery Michihiro Kuramochi
and George Karypis1st IEEE Conference on Data
Mining 2001

62
Recap