Title: Why Not Store Everything in Main Memory Why use disks
1NDSU C.S. 783 Parallel and Vertical High
Performance Software Systems
Paper_Topic_7_Parallel Clustering with Vertical
DataBy Dr. William Perrizo and Dr. Gregory
Wettstein
2- Data Mining is one aspect of Database Query
Processing (on the "what if" or pattern and trend
end of Query Processing, rather than the "please
find" or straight forward end. - To say it another way, data mining queries are on
the ad hoc or unstructured end of the query
spectrum rather than standard report generation
or "retieve all records matching a criteria" or
SQL side). - Still, Data Mining queries ARE queries and are
processed (or will eventually be processed) by a
Database Management System the same way queries
are processed today, namely - 1. SCAN and PARSE (SCANNER-PARSER) A Scanner
identifies the tokens or language elements of the
DM query. The Parser check for syntax or grammar
validity. - 2. VALIDATED The Validator checks for valid
names and semantic correctness. - 3. CONVERTER converts to an internal
representation. - 4. QUERY OPTIMIZED the Optimzier devises a
stategy for executing the DM query (chooses among
alternative Query internal representations). - 5. CODE GENERATION generates code to implement
each operator in the selected DM query plan (the
optimizer-selected internal representation). - 6. RUNTIME DATABASE PROCESSORING run plan code.
- Developing new, efficient and effective
DataMining Query (DMQ) processors is the central
need and issue in DBMS research today (far and
away!). - These notes concentrate on 5, i.e., generating
code (algorithms) to implement operators (at a
high level) namely operators that do
Association Rule Mining (ARM), Clustering (CLU),
Classification (CLA)
Database Analysis consists of Querying and Data
Mining. Data Mining can be broken down into 2
areas, Machine Learning and Assoc. Rule
Mining Machine Learning can be broken down into
2 areas, Clustering and Classification. Cluster
ing can be broken down into 2 types, Isotropic
(round clusters) and Density-based Classification
can be broken down into 2 types,
Model-based and Neighbor-based
3Clustering Methods
- Clustering is partitioning
- into mutually exclusive and collectively
exhaustive subsets, such that each point is - very similar to (close to) the other points in
its component and - very dissimilar to (far from) the points in the
other components. - A Categorization of Major Clustering Methods
- Partitioning methods (K-means K-medoids...)
- Hierarchical methods (Agnes, Diana...)
- Density-based methods
- Grid-based methods
- Model-based methods
4The K-Means Clustering Method
- Given k, the k-means algorithm is implemented in
4 steps (assumes partitioning criteria is
maximize intra-cluster similarity and minimize
inter-cluster similarity. Of course, a heuristic
is used. Method isnt really an optimization) - Partition objects into k nonempty subsets (or
pick k initial means). - Compute the mean (center) or centroid of each
cluster of the current partition (if one started
with k means, then this step is done). - centroid point that minimizes the sum of
dissimilarities from the mean or the sum of the
square errors from the mean. - Assign each object to the cluster with the most
similar (closest) center. - Go back to Step 2
- Stop when the new set of means doesnt change
(or some other stopping condition?)
5k-Means Clustering annimation
Step 1
Step 2
Step 3
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Step 4
Strength relatively efficient O(tkn), n is
objects, k is clusters t is
iterations. Normally, k, t ltlt
n. Weakness Applicable only when mean is
defined (e.g., a vector space). Need to specify
k, the number of clusters, in advance. It is
sensitive to noisy data and outliers.
6The K-Medoids Clustering Method
- Find representative objects, called medoids,
(must be an actual object in the cluster, where
as the mean seldom is). - PAM (Partitioning Around Medoids, 1987)
- starts from an initial set of medoids
- iteratively replaces one of the medoids by a
non-medoid - if it improves the aggregate similarity measure,
retain the swap. Do this over all
medoid-nonmedoid pairs - PAM works for small data sets. Does not scale
for large data sets - CLARA (Clustering LARge Applications)
(Kaufmann,Rousseeuw, 1990) Sub-samples then apply
PAM - CLARANS (Clustering Large Applications based on
RANdom - Search) (Ng Han, 1994) Randomized the
sampling
7Hierarchical Clustering Methods AGNES
(Agglomerative Nesting)
- Introduced in Kaufmann and Rousseeuw (1990)
- Use the Single-Link (distance between two sets is
the minimum pairwise distance) method - Other options are complete link (distance is max
pairwise) average,... - Merge nodes that are most similarity
- Eventually all nodes belong to the same cluster
8DIANA (Divisive Analysis)
- Introduced in Kaufmann and Rousseeuw (1990)
- Inverse order of AGNES (initially all objects are
in one cluster then it is split according to
some criteria (e.g., maximize some aggregate
measure of pairwise dissimilarity again) - Eventually each node forms a cluster on its own
9Contrasting hierarchical Clustering Techniques
- Hierarchical alg Create hierarchical
decomposition of ever-finer partitions. - e.g., top down (divisively).
bottom up (agglomerative)
10Hierarchical Clustering
11Hierarchical Clustering (top down)
- In either case, one gets a nice dendogram in
which any maximal anti-chain (no 2 nodes linked)
is a clustering (partition).
12Hierarchical Clustering (Cont.)
Recall that any maximal anti-chain (maximal set
of nodes in which no 2 are chained) is a
clustering (a dendogram offers many).
13Hierarchical Clustering (Cont.)
But the horizontal anti-chains are the
clusterings resulting from the top down (or
bottom up) method(s).
14Data Mining Summary
Data Mining on a given table of data
includes Association Rule Mining (ARM) on
Bipartite Relationships Clustering Partitionin
g methods (K-means K-medoids...), Hierarchical
methods (Agnes, Diana...), Model-based methods
(K-Means, K-Medoids..), .... Classification
Decision Tree Induction, Bayesian, Neural
Network, k-Nearest-Neighbor,...)
But most data mining is on a database, not just
one table, that is, often times, first one must
apply the appropriate SQL query to a database to
get the table to be data mined. The next slides
discuss vertical data methods for doing that. You
may wish to skip this material if not interested
in the topic.
15APPENDIX-1 Data Mining DUALITIES1.
PARTITION FUNCTION
EQUIVALENCE RELATION
UNDIRECTED GRAPH
- Given any set, S A Partition is a decomposition
of a set into subsets which are mutually
exclusive (non-overlapping) and collectively
exhaustive (every set point is in one subset).
We assume the partition has unique labels on each
of its component subsets (required for
unambiguous reference). - An Equivalence Relation equivocates pairs of
S-points, x y such that xx xy?yx xy and
yz ?xz. - A Partition of S
- Induces the Function which takes each point
in the set to the label of its component, - Induces the Equivalence Relation which
equates two points iff they are in the same
component, - Induces the Undirected Graph with an edge
connecting each S-pair from the same component. - A Function, f, from S to R
- Induces the Partition into the pre-image
components of its R-points, f-1(r)r?R - Induces the Equivalence Relation which
equates xy iff f(x)f(y)r (labeling that
component as r), - Induces the Undirected Graph with an edge
connecting x with y iff f(x)f(y). - An Equivalence Relation on S, xy
- Induces the Partition into its equivalence
sets, y?xcomp ? S iff yx (xs canonical
comp. reps. ), - Induces the function, f(y)x iff y?xcomp,
- Induces the Undirected Graph with an edge
connecting x with y iff xy.
2. PARTIALLY ORDERED SET CLOSED
DIRECTED ACYCLIC GRAPH A Closed Directed Acyclic
Graph on S Induced thePartially Ordered Set
containing (x, y) iff there is an edge from x to
y in the graph. A Partially Ordering, ? , on
S Induced the Directed Acyclic Graph with an
edge running form x to y iff x ? y.
16RoloDex ModelThe dualities on the previous
slide apply to unary relationships (relationships
in which there is one entity and we are relating
pairs of instances from that entity).The
relationships we had been talking about were all
Bipartite Relationships in which we have two
separate entities (or two disjoint subsets from
the same entity), and we only relate x to y if x
and y are instance from different entities.
Often, we need to analyze an even more complex
situation in which we have a combination of
bipartite and unipartite relationships called,
- Bipartite - Unipartite on Part (BUP)
relationships. - Examples
- In Bioinformatics, bipartite relationships
between genes and experiments (a gene is related
to an experiment iff it expresses at a threshold
level in that experiment) are studied in
conjunction with unipartite relationships gene
pairs (e.g., gene-gene or protein-protein
interactions). - In Market Research, bipartite relationships
between items and customers are studied in
conjunction with unipartite relationships on the
customers (e.g., xy iff x and y are males under
25). - For these BUP situation, we suggest the RoloDex
Model. In this model, we each relationship is
expressed as a "card" in a rolodex revolving
around the entities involved in that relationship.
17The Bipartite, Unipartite-on-Part (BUP),
Experiment-Gene-Gene Relationship, EGG
So as not to duplicate axes, this copy of GGenes
should ideally be folded over to coincide with
the other copy, producing a "conical" unipartite
card.
Each conical RoloDex card revolving about the
Gene Entity axis is a separate Gene-Gene (or
Protien-Protein) interactions
GGenes
4
3
2
1
1
2
3
4
GGenes
1
1
3
Each rectangular RoloDex card revolving about the
Gene Entity axis and about the Experiment axis is
a separate Gene-Experiment relationship.
EExpiments
18? Axis-Card pair (Entity-Relationship pair),
a?c(a,b), ? a support count for AxisSets (or
ratio or ) ? A, for a graph relationship,
suppG?(A, a?c(a,b))b?a?A, (a,b)?c and for
a multigraph, suppMG? is the histogram over b of
(a,b)-EdgeCounts, a?A. Other quantifiers can be
used also (e.g., the universal, ? is used in MBR)
Conf(A?B) Supp(A?B)/Supp(A)
Supp(A) CusFreq(ItemSet)
cust item card
termdoc card
authordoc card
Most interestingness measure are based on one of
these supports. In IR, df(t) suppG?(t,
t?c(t,d)) tf(t,d) is the one histogram bar in
suppMG?(t, t?c(t,d)) In MBR
supp(I)suppG?(I. i?c(i,t)) In MDA, suppMG?(GSet,
g?c(g,e)) Of course all supports are inherited
redundantly by the card, c(a,b).
genegene card (ppi)
docdoc
People ?
expPI card
expgene card
genegene card (ppi)
RoloDex Model
termterm card (share stem?)
19Cousin Association Rules (CARs)
- ? card (RELATIONSHIP), c(I,T), one has
- I-Association Rules among disjoint Isets, A?C,
? A,C ?I, with AnCØ and - T-Association Rules among disjoint Tsets, A?C,
?A,C ?T, with AnCØ - Two measures of quality of A?C are
SUPP(A?C) where e.g., for any Iset, A,
SUPP(A) t (i,t)?E ?i?A CONF(A?C)
SUPP(A?C)/SUPP(A) - First Cousin Association Rules
- Given any card, c(T,U) sharing axis, T, with the
bipartite relationship, b(T,I), - Cousin Association Rules are those in which the
antecedent, T-sets is generated by a subset, S,
of U as follows t?T?u?S such that (t,u)?C
(note this should be called an "existential first
cousin AR" since we are using the existential
quantifier. One can use the universal quantifier
(as was used in MBR ARM) ) - E.g., S ? U, AC(S), A'?T then A?A' is a CAR
and we can also label it S?A' - First Cousin Association Rules Once Removed
(FCAR-1r) are those in which both Tsets are
generated by another bipartite relationship and
we can label antecedent and or the consequent
using the generating set or the Tset.
20The Cousin Association Rules
- Second Cousin Association Rules are those in
which the antecedent Tset is generated by a
subset of an axis which shares a card with T,
which shares the card, B, with I. 2CARs can be
denoted using the generating (second cousin) set
or the Tset antecedent.
Second Cousin Association Rules once removed are
those in which the antecedent Tset is generated
by a subset of an axis which shares a card with
T, which shares the card, B, with I and the
consequent is generated by C(T,U) (a first
cousin, Tset) . 2CAR-1rs can be denoted using
any combination of the generating (second cousin)
set or the Tset antecedent and the generating
(first cousin) or Tset consequent.
Second Cousin Association Rules twice removed are
those in which the antecedent Tset is generated
by a subset of an axis which shares a card with
T, which shares the card, B, with I and the
consequent is generated by a subset of an axis
which shares a card with T, which shares another
first cousin card with I. 2CAR-2rs can be
denoted using any combination of the generating
(second cousin) set or the Tset antecedent and
the generating (second cousin) or Tset
consequent. Note 2CAR-2rs are also 2CAR-1rs so
they can be denoted as above also.
Third Cousin Association Rules are those.... We
note that these definitions give us many
opportunities to define quality measures
21Measuring Quality in the RoloDex Model
cust item card
termdoc card
authordoc card
genegene card (ppi)
docdoc
People ?
For Distance CARMA relationships, quality (e.g.,
supp or conf or???) can be measured using
information on any/all cards along the
relationship (multiple cards can contribute
factors or terms or in some other way???)
expPI card
expgene card
genegene card (ppi)
termterm card (share stem?)
22 First, we
propose definition of Generalized Association
Rules (GARs) which contains the standard "1
Entity Itemset" AR definition as a special
case. Association Pathway Mining (APM) is a DM
technique (with application to bioinformatics?) G
iven Relationships, R1,R2 (RoloDex cards) with
shared Entity,F, (axis), E?R1?F?R2?G and given
A?E and C?G, then A?C , is a Generalized F
Association Rule, with SupportR1R2(A?C)
t?E2 ?a?A, (a,t)?R1 and ?c?C, (c,t)?R2
ConfidenceR1R2(A?C) SupportR1R2(A?C) /
SupportR1(A) where as always,
SupportR1(A) t?F?a?A,
(a,t)?R1. EG, the GAR is a
standard AR iff A?C?. Association Pathway
Mining (APM) is the identification and
assessment (e.g., support, confidence, etc.)of
chains of GARs in a RoloDex. Restricting to the
mining of cousin GARs reduces the number of
strong rules or pathways links.
Generalized CARs
23More generally, for entities E, F, G and
relationships, R1(E,F) and R2(F,G), A ?
E?R1?F?R2?G ? C Support-SetR1R2(A,C)
SSR1,R2(A,C) t?E2?a?A (a,t)?R1,?c?C
(c,t)?R2 If E2 has real labels,
Label-Weighted-SupportR1R2(A,C) LWSR1R2(A,C)
?t?SSR1R2label(t) (the un-generalized case
occurs when all weights are 1)
Downward closure property of Support
Sets SS(A,C') ? SS(A,C) ?A'?A,
C'?C Therefore, if all labels are non-negative,
then LWS(A,C) ? LWS(A,C') (in order for
LSW(A,C) to exceed a threshold is that all
LWS(A,C') exceed that threshold ?A'?A,
C'?C). So an Apriori-like frequent set pair mine
would go as Start with pairs of 1-sets (in E and
G). The only candidate 2-antecedents with
1-consequents (equiv, 2-consequents with
1-antecedents) would be those formed by joining
... The weighted support concept can be extended
to the case there R1 and/or R2 have labels as
well. Vertical methods can be applied by
converting F to vertical format (F instances are
the rows and pertinent features from other
cards/axes are "rolled over" to F as derived
feature attributes
l2,3
l2,2
F
R3
R1
G
E
C
A
24VERTIGO A Vertically Structured Rendition of
the GO (Gene Ontology)? How do we include GO
data into the Data Mining processes? 1. Treat
it as a horizontally structured dataset. 2.
View GO as a Gene Set hierarchy (that seems to be
how it is used, often?) with the other aspects of
it as node labels. One could then minimize it -
as a subset of the Set Enumeration Tree with
highly structured labels? 3. Preprocess
pertinent GO information into derived attributes
on a Gene Table. 4. Use the RoloDex Model for
it? Preliminary thoughts on this alternative
include Each of the three major annotation areas
(Molecular Function, Cellular Location,
Biological Process) is a Gene-to-Annotation Card.
25Thank you.