Why Not Store Everything in Main Memory Why use disks - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Why Not Store Everything in Main Memory Why use disks

Description:

Why Not Store Everything in Main Memory Why use disks – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 26
Provided by: william77
Category:
Tags: disks | everything | main | memory | rtw | store | use

less

Transcript and Presenter's Notes

Title: Why Not Store Everything in Main Memory Why use disks


1
NDSU C.S. 783 Parallel and Vertical High
Performance Software Systems
Paper_Topic_7_Parallel Clustering with Vertical
DataBy Dr. William Perrizo and Dr. Gregory
Wettstein
2
  • Data Mining is one aspect of Database Query
    Processing (on the "what if" or pattern and trend
    end of Query Processing, rather than the "please
    find" or straight forward end.
  • To say it another way, data mining queries are on
    the ad hoc or unstructured end of the query
    spectrum rather than standard report generation
    or "retieve all records matching a criteria" or
    SQL side).
  • Still, Data Mining queries ARE queries and are
    processed (or will eventually be processed) by a
    Database Management System the same way queries
    are processed today, namely
  • 1. SCAN and PARSE (SCANNER-PARSER) A Scanner
    identifies the tokens or language elements of the
    DM query. The Parser check for syntax or grammar
    validity.
  • 2. VALIDATED The Validator checks for valid
    names and semantic correctness.
  • 3. CONVERTER converts to an internal
    representation.
  • 4. QUERY OPTIMIZED the Optimzier devises a
    stategy for executing the DM query (chooses among
    alternative Query internal representations).
  • 5. CODE GENERATION generates code to implement
    each operator in the selected DM query plan (the
    optimizer-selected internal representation).
  • 6. RUNTIME DATABASE PROCESSORING run plan code.
  • Developing new, efficient and effective
    DataMining Query (DMQ) processors is the central
    need and issue in DBMS research today (far and
    away!).
  • These notes concentrate on 5, i.e., generating
    code (algorithms) to implement operators (at a
    high level) namely operators that do
    Association Rule Mining (ARM), Clustering (CLU),
    Classification (CLA)

Database Analysis consists of Querying and Data
Mining. Data Mining can be broken down into 2
areas, Machine Learning and Assoc. Rule
Mining Machine Learning can be broken down into
2 areas, Clustering and Classification. Cluster
ing can be broken down into 2 types, Isotropic
(round clusters) and Density-based Classification
can be broken down into 2 types,
Model-based and Neighbor-based
3
Clustering Methods
  • Clustering is partitioning
  • into mutually exclusive and collectively
    exhaustive subsets, such that each point is
  • very similar to (close to) the other points in
    its component and
  • very dissimilar to (far from) the points in the
    other components.
  • A Categorization of Major Clustering Methods
  • Partitioning methods (K-means K-medoids...)
  • Hierarchical methods (Agnes, Diana...)
  • Density-based methods
  • Grid-based methods
  • Model-based methods

4
The K-Means Clustering Method
  • Given k, the k-means algorithm is implemented in
    4 steps (assumes partitioning criteria is
    maximize intra-cluster similarity and minimize
    inter-cluster similarity. Of course, a heuristic
    is used. Method isnt really an optimization)
  • Partition objects into k nonempty subsets (or
    pick k initial means).
  • Compute the mean (center) or centroid of each
    cluster of the current partition (if one started
    with k means, then this step is done).
  • centroid point that minimizes the sum of
    dissimilarities from the mean or the sum of the
    square errors from the mean.
  • Assign each object to the cluster with the most
    similar (closest) center.
  • Go back to Step 2
  • Stop when the new set of means doesnt change
    (or some other stopping condition?)

5
k-Means Clustering annimation
Step 1
Step 2
Step 3
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Step 4
Strength relatively efficient O(tkn), n is
objects, k is clusters t is
iterations. Normally, k, t ltlt
n. Weakness Applicable only when mean is
defined (e.g., a vector space). Need to specify
k, the number of clusters, in advance. It is
sensitive to noisy data and outliers.
6
The K-Medoids Clustering Method
  • Find representative objects, called medoids,
    (must be an actual object in the cluster, where
    as the mean seldom is).
  • PAM (Partitioning Around Medoids, 1987)
  • starts from an initial set of medoids
  • iteratively replaces one of the medoids by a
    non-medoid
  • if it improves the aggregate similarity measure,
    retain the swap. Do this over all
    medoid-nonmedoid pairs
  • PAM works for small data sets. Does not scale
    for large data sets
  • CLARA (Clustering LARge Applications)
    (Kaufmann,Rousseeuw, 1990) Sub-samples then apply
    PAM
  • CLARANS (Clustering Large Applications based on
    RANdom
  • Search) (Ng Han, 1994) Randomized the
    sampling

7
Hierarchical Clustering Methods AGNES
(Agglomerative Nesting)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Use the Single-Link (distance between two sets is
    the minimum pairwise distance) method
  • Other options are complete link (distance is max
    pairwise) average,...
  • Merge nodes that are most similarity
  • Eventually all nodes belong to the same cluster

8
DIANA (Divisive Analysis)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Inverse order of AGNES (initially all objects are
    in one cluster then it is split according to
    some criteria (e.g., maximize some aggregate
    measure of pairwise dissimilarity again)
  • Eventually each node forms a cluster on its own

9
Contrasting hierarchical Clustering Techniques
  • Hierarchical alg Create hierarchical
    decomposition of ever-finer partitions.
  • e.g., top down (divisively).

bottom up (agglomerative)
10
Hierarchical Clustering
11
Hierarchical Clustering (top down)
  • In either case, one gets a nice dendogram in
    which any maximal anti-chain (no 2 nodes linked)
    is a clustering (partition).

12
Hierarchical Clustering (Cont.)
Recall that any maximal anti-chain (maximal set
of nodes in which no 2 are chained) is a
clustering (a dendogram offers many).
13
Hierarchical Clustering (Cont.)
But the horizontal anti-chains are the
clusterings resulting from the top down (or
bottom up) method(s).
14
Data Mining Summary
Data Mining on a given table of data
includes Association Rule Mining (ARM) on
Bipartite Relationships Clustering Partitionin
g methods (K-means K-medoids...), Hierarchical
methods (Agnes, Diana...), Model-based methods
(K-Means, K-Medoids..), .... Classification
Decision Tree Induction, Bayesian, Neural
Network, k-Nearest-Neighbor,...)
But most data mining is on a database, not just
one table, that is, often times, first one must
apply the appropriate SQL query to a database to
get the table to be data mined. The next slides
discuss vertical data methods for doing that. You
may wish to skip this material if not interested
in the topic.
15
APPENDIX-1 Data Mining DUALITIES1.
PARTITION FUNCTION
EQUIVALENCE RELATION
UNDIRECTED GRAPH
  • Given any set, S A Partition is a decomposition
    of a set into subsets which are mutually
    exclusive (non-overlapping) and collectively
    exhaustive (every set point is in one subset).
    We assume the partition has unique labels on each
    of its component subsets (required for
    unambiguous reference).
  • An Equivalence Relation equivocates pairs of
    S-points, x y such that xx xy?yx xy and
    yz ?xz.
  • A Partition of S
  • Induces the Function which takes each point
    in the set to the label of its component,
  • Induces the Equivalence Relation which
    equates two points iff they are in the same
    component,
  • Induces the Undirected Graph with an edge
    connecting each S-pair from the same component.
  • A Function, f, from S to R
  • Induces the Partition into the pre-image
    components of its R-points, f-1(r)r?R
  • Induces the Equivalence Relation which
    equates xy iff f(x)f(y)r (labeling that
    component as r),
  • Induces the Undirected Graph with an edge
    connecting x with y iff f(x)f(y).
  • An Equivalence Relation on S, xy
  • Induces the Partition into its equivalence
    sets, y?xcomp ? S iff yx (xs canonical
    comp. reps. ),
  • Induces the function, f(y)x iff y?xcomp,
  • Induces the Undirected Graph with an edge
    connecting x with y iff xy.

2. PARTIALLY ORDERED SET CLOSED
DIRECTED ACYCLIC GRAPH A Closed Directed Acyclic
Graph on S Induced thePartially Ordered Set
containing (x, y) iff there is an edge from x to
y in the graph. A Partially Ordering, ? , on
S Induced the Directed Acyclic Graph with an
edge running form x to y iff x ? y.
16
RoloDex ModelThe dualities on the previous
slide apply to unary relationships (relationships
in which there is one entity and we are relating
pairs of instances from that entity).The
relationships we had been talking about were all
Bipartite Relationships in which we have two
separate entities (or two disjoint subsets from
the same entity), and we only relate x to y if x
and y are instance from different entities.
Often, we need to analyze an even more complex
situation in which we have a combination of
bipartite and unipartite relationships called,
  • Bipartite - Unipartite on Part (BUP)
    relationships.
  • Examples
  • In Bioinformatics, bipartite relationships
    between genes and experiments (a gene is related
    to an experiment iff it expresses at a threshold
    level in that experiment) are studied in
    conjunction with unipartite relationships gene
    pairs (e.g., gene-gene or protein-protein
    interactions).
  • In Market Research, bipartite relationships
    between items and customers are studied in
    conjunction with unipartite relationships on the
    customers (e.g., xy iff x and y are males under
    25).
  • For these BUP situation, we suggest the RoloDex
    Model. In this model, we each relationship is
    expressed as a "card" in a rolodex revolving
    around the entities involved in that relationship.

17
The Bipartite, Unipartite-on-Part (BUP),
Experiment-Gene-Gene Relationship, EGG
So as not to duplicate axes, this copy of GGenes
should ideally be folded over to coincide with
the other copy, producing a "conical" unipartite
card.
Each conical RoloDex card revolving about the
Gene Entity axis is a separate Gene-Gene (or
Protien-Protein) interactions
GGenes
4
3
2
1
1
2
3
4
GGenes
1
1
3
Each rectangular RoloDex card revolving about the
Gene Entity axis and about the Experiment axis is
a separate Gene-Experiment relationship.
EExpiments
18
? Axis-Card pair (Entity-Relationship pair),
a?c(a,b), ? a support count for AxisSets (or
ratio or ) ? A, for a graph relationship,
suppG?(A, a?c(a,b))b?a?A, (a,b)?c and for
a multigraph, suppMG? is the histogram over b of
(a,b)-EdgeCounts, a?A. Other quantifiers can be
used also (e.g., the universal, ? is used in MBR)
Conf(A?B) Supp(A?B)/Supp(A)
Supp(A) CusFreq(ItemSet)
cust item card
termdoc card
authordoc card
Most interestingness measure are based on one of
these supports. In IR, df(t) suppG?(t,
t?c(t,d)) tf(t,d) is the one histogram bar in
suppMG?(t, t?c(t,d)) In MBR
supp(I)suppG?(I. i?c(i,t)) In MDA, suppMG?(GSet,
g?c(g,e)) Of course all supports are inherited
redundantly by the card, c(a,b).
genegene card (ppi)
docdoc
People ?
expPI card
expgene card
genegene card (ppi)
RoloDex Model
termterm card (share stem?)
19
Cousin Association Rules (CARs)
  • ? card (RELATIONSHIP), c(I,T), one has
  • I-Association Rules among disjoint Isets, A?C,
    ? A,C ?I, with AnCØ and
  • T-Association Rules among disjoint Tsets, A?C,
    ?A,C ?T, with AnCØ
  • Two measures of quality of A?C are
    SUPP(A?C) where e.g., for any Iset, A,
    SUPP(A) t (i,t)?E ?i?A CONF(A?C)
    SUPP(A?C)/SUPP(A)
  • First Cousin Association Rules
  • Given any card, c(T,U) sharing axis, T, with the
    bipartite relationship, b(T,I),
  • Cousin Association Rules are those in which the
    antecedent, T-sets is generated by a subset, S,
    of U as follows t?T?u?S such that (t,u)?C
    (note this should be called an "existential first
    cousin AR" since we are using the existential
    quantifier. One can use the universal quantifier
    (as was used in MBR ARM) )
  • E.g., S ? U, AC(S), A'?T then A?A' is a CAR
    and we can also label it S?A'
  • First Cousin Association Rules Once Removed
    (FCAR-1r) are those in which both Tsets are
    generated by another bipartite relationship and
    we can label antecedent and or the consequent
    using the generating set or the Tset.

20
The Cousin Association Rules
  • Second Cousin Association Rules are those in
    which the antecedent Tset is generated by a
    subset of an axis which shares a card with T,
    which shares the card, B, with I. 2CARs can be
    denoted using the generating (second cousin) set
    or the Tset antecedent.

Second Cousin Association Rules once removed are
those in which the antecedent Tset is generated
by a subset of an axis which shares a card with
T, which shares the card, B, with I and the
consequent is generated by C(T,U) (a first
cousin, Tset) . 2CAR-1rs can be denoted using
any combination of the generating (second cousin)
set or the Tset antecedent and the generating
(first cousin) or Tset consequent.
Second Cousin Association Rules twice removed are
those in which the antecedent Tset is generated
by a subset of an axis which shares a card with
T, which shares the card, B, with I and the
consequent is generated by a subset of an axis
which shares a card with T, which shares another
first cousin card with I. 2CAR-2rs can be
denoted using any combination of the generating
(second cousin) set or the Tset antecedent and
the generating (second cousin) or Tset
consequent. Note 2CAR-2rs are also 2CAR-1rs so
they can be denoted as above also.
Third Cousin Association Rules are those.... We
note that these definitions give us many
opportunities to define quality measures
21
Measuring Quality in the RoloDex Model
cust item card
termdoc card
authordoc card
genegene card (ppi)
docdoc
People ?
For Distance CARMA relationships, quality (e.g.,
supp or conf or???) can be measured using
information on any/all cards along the
relationship (multiple cards can contribute
factors or terms or in some other way???)
expPI card
expgene card
genegene card (ppi)
termterm card (share stem?)
22
First, we
propose definition of Generalized Association
Rules (GARs) which contains the standard "1
Entity Itemset" AR definition as a special
case. Association Pathway Mining (APM) is a DM
technique (with application to bioinformatics?) G
iven Relationships, R1,R2 (RoloDex cards) with
shared Entity,F, (axis), E?R1?F?R2?G and given
A?E and C?G, then A?C , is a Generalized F
Association Rule, with SupportR1R2(A?C)
t?E2 ?a?A, (a,t)?R1 and ?c?C, (c,t)?R2
ConfidenceR1R2(A?C) SupportR1R2(A?C) /
SupportR1(A) where as always,
SupportR1(A) t?F?a?A,
(a,t)?R1. EG, the GAR is a
standard AR iff A?C?. Association Pathway
Mining (APM) is the identification and
assessment (e.g., support, confidence, etc.)of
chains of GARs in a RoloDex. Restricting to the
mining of cousin GARs reduces the number of
strong rules or pathways links.
Generalized CARs
23
More generally, for entities E, F, G and
relationships, R1(E,F) and R2(F,G), A ?
E?R1?F?R2?G ? C Support-SetR1R2(A,C)
SSR1,R2(A,C) t?E2?a?A (a,t)?R1,?c?C
(c,t)?R2 If E2 has real labels,
Label-Weighted-SupportR1R2(A,C) LWSR1R2(A,C)
?t?SSR1R2label(t) (the un-generalized case
occurs when all weights are 1)
Downward closure property of Support
Sets SS(A,C') ? SS(A,C) ?A'?A,
C'?C Therefore, if all labels are non-negative,
then LWS(A,C) ? LWS(A,C') (in order for
LSW(A,C) to exceed a threshold is that all
LWS(A,C') exceed that threshold ?A'?A,
C'?C). So an Apriori-like frequent set pair mine
would go as Start with pairs of 1-sets (in E and
G). The only candidate 2-antecedents with
1-consequents (equiv, 2-consequents with
1-antecedents) would be those formed by joining
... The weighted support concept can be extended
to the case there R1 and/or R2 have labels as
well. Vertical methods can be applied by
converting F to vertical format (F instances are
the rows and pertinent features from other
cards/axes are "rolled over" to F as derived
feature attributes
l2,3
l2,2
F
R3
R1
G
E
C
A
24
VERTIGO A Vertically Structured Rendition of
the GO (Gene Ontology)? How do we include GO
data into the Data Mining processes? 1. Treat
it as a horizontally structured dataset. 2.
View GO as a Gene Set hierarchy (that seems to be
how it is used, often?) with the other aspects of
it as node labels. One could then minimize it -
as a subset of the Set Enumeration Tree with
highly structured labels? 3. Preprocess
pertinent GO information into derived attributes
on a Gene Table. 4. Use the RoloDex Model for
it? Preliminary thoughts on this alternative
include Each of the three major annotation areas
(Molecular Function, Cellular Location,
Biological Process) is a Gene-to-Annotation Card.

25
Thank you.
Write a Comment
User Comments (0)
About PowerShow.com