Title: Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)
1Lecture 3Descriptive Data MiningPeter van der
Putten(putten_at_liacs.nl)
2Course Outline
- Objective
- Understand the basics of data mining
- Gain understanding of the potential for applying
it in the bioinformatics domain - Hands on experience
- Schedule
- Evaluation
- Practical assignment (2nd) plus take home
exercise - Website
- http//www.liacs.nl/putten/edu/dbdm05/
3Agenda TodayDescriptive Data Mining
- Before Starting to Mine.
- Descriptive Data Mining
- Dimension Reduction Projection
- Clustering
- Hierarchical clustering
- K-means
- Self organizing maps
- Association rules
- Frequent item sets
- Association Rules
- APRIORI
- Bio-informatics case FSG for frequent subgraph
discovery
4Before starting to mine.
- Pima Indians Diabetes Data
- X body mass index
- Y age
5Before starting to mine.
6Before starting to mine.
7Before starting to mine.
- Attribute Selection
- This example InfoGain by Attribute
- Keep the most important ones
8Before starting to mine.
- Types of Attribute Selection
- Uni-variate versus multivariate (sub set
selection) - The fact that attribute x is a strong uni-variate
predictor does not necessarily mean it will add
predictive power to a set of predictors already
used by a model - Filter versus wrapper
- Wrapper methods involve the subsequent learner
(classifier or other)
9Dimension Reduction
- Projecting high dimensional data into a lower
dimension - Principal Component Analysis
- Independent Component Analysis
- Fisher Mapping, Sammons Mapping etc.
- Multi Dimensional Scaling
- See Pattern Recognition Course (Duin)
10Data Mining Tasks Clustering
Clustering is the discovery of groups in a set of
instances Groups are different, instances in a
group are similar In 2 to 3 dimensional pattern
space you could just visualise the data and leave
the recognition to a human end user
f.e. weight
f.e. age
11Data Mining Tasks Clustering
Clustering is the discovery of groups in a set of
instances Groups are different, instances in a
group are similar In 2 to 3 dimensional pattern
space you could just visualise the data and leave
the recognition to a human end user In gt3
dimensions this is not possible
f.e. weight
f.e. age
12Clustering Techniques
- Hierarchical algorithms
- Agglomerative
- Divisive
- Partition based clustering
- K-Means
- Self Organizing Maps / Kohonen Networks
- Probabilistic Model based
- Expectation Maximization / Mixture Models
13Hierarchical clustering
- Agglomerative / Bottom up
- Start with single-instance clusters
- At each step, join the two closest clusters
- Method to compute distance between cluster x and
y single linkage (distance between closest point
in cluster x and y), average linkage (average
distance between all points), complete linkage
(distance between furthest points), centroid - Distance measure Euclidean, Correlation etc.
- Divisive / Top Down
- Start with all data in one cluster
- Split into two clusters based on category utility
- Proceed recursively on each subset
- Both methods produce a dendrogram
14Levels of Clustering
Agglomerative
Divisive
Dunham, 2003
15Hierarchical Clustering Example
- Clustering Microarray Gene Expression Data
- Gene expression measured using microarrays
studied under variety of conditions - On budding yeast Saccharomyces cerevisiae
- Groups together efficiently genes of known
similar function,
16Hierarchical Clustering Example
- Method
- Genes are the instances, samples the attributes!
- Agglomerative
- Distance measure correlation
17Simple Clustering K-means
- Pick a number (k) of cluster centers (at random)
- Cluster centers are sometimes called codes, and
the k codes a codebook - Assign every item to its nearest cluster center
- F.i. Euclidean distance
- Move each cluster center to the mean of its
assigned items - Repeat until convergence
- change in cluster assignments less than a
threshold
KDnuggets
18K-means example, step 1
Initially distribute codes randomly in
pattern space
KDnuggets
19K-means example, step 2
Assign each point to the closest code
KDnuggets
20K-means example, step 3
Move each code to the mean of all its assigned
points
KDnuggets
21K-means example, step 2
Repeat the process reassign the data points to
the codes Q Which points are reassigned?
KDnuggets
22K-means example
Repeat the process reassign the data points to
the codes Q Which points are reassigned?
KDnuggets
23K-means example
re-compute cluster means
KDnuggets
24K-means example
move cluster centers to cluster means
KDnuggets
25K-means clustering summary
- Advantages
- Simple, understandable
- items automatically assigned to clusters
- Disadvantages
- Must pick number of clusters before hand
- All items forced into a cluster
- Sensitive to outliers
- Extensions
- Adaptive k-means
- K-mediods (based on median instead of mean)
- 1,2,3,4,100 ? average 22, median 3
26Biological Example
- Clustering of yeast cell images
- Two clusters are found
- Left cluster primarily cells with thick capsule,
right cluster thin capsule - caused by media, proxy for sick vs healthy
27Self Organizing Maps(Kohonen Maps)
- Claim to fame
- Simplified models of cortical maps in the brain
- Things that are near in the outside world link to
areas near in the cortex - For a variety of modalities touch, motor, . up
to echolocation - Nice visualization
- From a data mining perspective
- SOMs are simple extensions of k-means clustering
- Codes are connected in a lattice
- In each iteration codes neighboring winning code
in the lattice are also allowed to move
28SOM
10x10 SOM Gaussian Distribution
29SOM
30SOM
31SOM
32SOM example
33Famous examplePhonetic Typewriter
- SOM lattice below left is trained on spoken
letters, after convergence codes are labeled - Creates a phonotopic map
- Spoken word creates a sequence of labels
34Famous examplePhonetic Typewriter
- Criticism
- Topology preserving property is not used so why
use SOMs and not adaptive k-means for instance? - K-means could also create a sequence
- This is true for most SOM applications!
- Is using clustering for classification optimal?
35Bioinformatics ExampleClustering GPCRs
- Clustering G Protein Coupled Receptors (GPCRs)
Samsanova et al, 2003, 2004 - Important drug target, function often unknown
36Bioinformatics ExampleClustering GPCRs
37Association Rules Outline
- What are frequent item sets association rules?
- Quality measures
- support, confidence, lift
- How to find item sets efficiently?
- APRIORI
- How to generate association rules from an item
set? - Biological examples
KDnuggets
38Market Basket ExampleGene Expression Example
- Frequent item set
- MILK, BREAD 4
- Association rule
- MILK, BREAD ? EGGS
- Frequency / importance 2 (Support)
- Quality 50 (Confidence)
- What genes are expressed (active) together?
- Interaction / regulation
- Similar function
39Association Rule Definitions
- Set of items II1,I2,,Im
- Transactions Dt1,t2, , tn, tj? I
- Itemset Ii1,Ii2, , Iik ? I
- Support of an itemset Percentage of transactions
which contain that itemset. - Large (Frequent) itemset Itemset whose number of
occurrences is above a threshold.
Dunham, 2003
40Frequent Item Set Example
I Beer, Bread, Jelly, Milk,
PeanutButter Support of Bread,PeanutButter is
60
Dunham, 2003
41Association Rule Definitions
- Association Rule (AR) implication X ? Y where
X,Y ? I and X,Y disjunct - Support of AR (s) X ? Y Percentage of
transactions that contain X ?Y - Confidence of AR (a) X ? Y Ratio of number of
transactions that contain X ? Y to the number
that contain X
Dunham, 2003
42Association Rules Ex (contd)
Dunham, 2003
43Association Rule Problem
- Given a set of items II1,I2,,Im and a
database of transactions Dt1,t2, , tn where
tiIi1,Ii2, , Iik and Iij ? I, the Association
Rule Problem is to identify all association rules
X ? Y with a minimum support and confidence. - NOTE Support of X ? Y is same as support of X ?
Y.
Dunham, 2003
44Association Rules Example
- Q Given frequent set A,B,E, what association
rules have minsup 2 and minconf 50 ? - A, B gt E conf2/4 50
- A, E gt B conf2/2 100
- B, E gt A conf2/2 100
- E gt A, B conf2/2 100
- Dont qualify
- A gtB, E conf2/6 33lt 50
- B gt A, E conf2/7 28 lt 50
- __ gt A,B,E conf 2/9 22 lt 50
-
KDnuggets
45Solution Association Rule Problem
- First, find all frequent itemsets with sup
gtminsup - Exhaustive search wont work
- Assume we have a set of m items ? 2m subsets!
- Exploit the subset property (APRIORI algorithm)
- For every frequent item set, derive rules with
confidence gt minconf
KDnuggets
46Finding itemsets next level
- Apriori algorithm (Agrawal Srikant)
- Idea use one-item sets to generate two-item
sets, two-item sets to generate three-item sets,
.. - Subset Property If (A B) is a frequent item set,
then (A) and (B) have to be frequent item sets as
well! - In general if X is frequent k-item set, then all
(k-1)-item subsets of X are also frequent - Compute k-item set by merging (k-1)-item sets
KDnuggets
47An example
- Given five three-item sets
- (A B C), (A B D), (A C D), (A C E), (B C D)
- Candidate four-item sets
- (A B C D) Q OK?
- A yes, because all 3-item subsets are frequent
- (A C D E) Q OK?
- A No, because (C D E) is not frequent
KDnuggets
48From Frequent Itemsets to Association Rules
- Q Given frequent set A,B,E, what are possible
association rules? - A gt B, E
- A, B gt E
- A, E gt B
- B gt A, E
- B, E gt A
- E gt A, B
- __ gt A,B,E (empty rule), or true gt A,B,E
KDnuggets
49Example Generating Rules from an Itemset
- Frequent itemset from golf data
- Seven potential rules
Humidity Normal, Windy False, Play Yes (4)
If Humidity Normal and Windy False then Play Yes If Humidity Normal and Play Yes then Windy False If Windy False and Play Yes then Humidity Normal If Humidity Normal then Windy False and Play Yes If Windy False then Humidity Normal and Play Yes If Play Yes then Humidity Normal and Windy False If True then Humidity Normal and Windy False and Play Yes 4/4 4/6 4/6 4/7 4/8 4/9 4/12
KDnuggets
50ExampleGenerating Rules
- Rules with support gt 1 and confidence 100
- In total 3 rules with support four, 5 with
support three, and 50 with support two
Association rule Sup. Conf.
1 HumidityNormal WindyFalse ?PlayYes 4 100
2 TemperatureCool ?HumidityNormal 4 100
3 OutlookOvercast ?PlayYes 4 100
4 TemperatureCold PlayYes ?HumidityNormal 3 100
... ... ... ... ...
58 OutlookSunny TemperatureHot ?HumidityHigh 2 100
KDnuggets
51Weka associations output
KDnuggets
52Extensions and Challenges
- Extra quality measure Lift
- The lift of an association rule I gt J is defined
as - lift P(JI) / P(J)
- Note, P(I) (support of I) / (no. of
transactions) - ratio of confidence to expected confidence
- Interpretation
- if lift gt 1, then I and J are positively
correlated - lift lt 1, then I are J are negatively
correlated. - lift 1, then I and J are
independent - Other measures for interestingness
- A ? B, B ? C, but not A ? C
- Efficient algorithms
- Known Problem
- What to do with all these rules? How to exploit /
make useful / actionable?
KDnuggets
53Biomedical ApplicationHead and Neck Cancer
Example
- 1. ace270 fiveyralive 381 ? tumorbefore0
372 conf(0.98) - 2. genderM ace270 467 ? tumorbefore0
455 conf(0.97) - 3. ace270 588 ? tumorbefore0 572
conf(0.97) - 4. tnmT0N0M0 ace270 405 ? tumorbefore0 391
conf(0.97) - 5. locLOC7 tumorbefore0 409 ? tnmT0N0M0 391
conf(0.96) - 6. locLOC7 442 ? tnmT0N0M0 422
conf(0.95) - 7. locLOC7 genderM tumorbefore0 374?
tnmT0N0M0 357 conf(0.95) - 8. locLOC7 genderM 406 ? tnmT0N0M0 387
conf(0.95) - 9. genderM fiveyralive 633 ? tumorbefore0 595
conf(0.94) - 10. fiveyralive 778 ? tumorbefore0 726
conf(0.93)
54Bioinformatics Application
- The idea of association rules have been
customized for bioinformatics applications - In biology it is often interesting to find
frequent structures rather than items - For instance protein or other chemical structures
- Solution Mining Frequent Patterns
- FSG (Kuramochi and Karypis, ICDM 2001)
- gSpan (Yan and Han, ICDM 2002)
- CloseGraph (Yan and Han, KDD 2002)
55FSG Mining Frequent Patterns
56FSG Mining Frequent Patterns
57FSG Algorithmfor finding frequent subgraphs
58Frequent Subgraph ExamplesAIDS Data
- Compounds are active, inactive or moderately
active (CA, CI, CM)
59Predictive Subgraphs
- The three most discriminating sub-structures
forthe PTC, AIDS, and Anthrax datasets
60Experiments and ResultsAIDS Data
61FSG References
- Frequent Sub-structure Based Approaches for
Classifying Chemical CompoundsMukund Deshpande,
Michihiro Kuramochi, and George KarypisICDM 2003 - An Efficient Algorithm for Discovering Frequent
SubgraphsMichihiro Kuramochi and George
KarypisIEEE TKDE - Automated Approaches for Classifying
StructuresMukund Deshpande, Michihiro Kuramochi,
and George KarypisBIOKDD 2002 - Discovering Frequent Geometric SubgraphsMichihiro
Kuramochi and George KarypisICDM 2002 - Frequent Subgraph Discovery Michihiro Kuramochi
and George Karypis1st IEEE Conference on Data
Mining 2001
62Recap
- Before Starting to Mine.
- Descriptive Data Mining
- Dimension Reduction Projection
- Clustering
- Hierarchical clustering
- K-means
- Self organizing maps
- Association rules
- Frequent item sets
- Association Rules
- APRIORI
- Bio-informatics case FSG for frequent subgraph
discovery - Next week
- Bioinformatics Data Mining Cases / Lab Session /
Take Home Exercise