DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Pr - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Pr

Description:

... v.s. feature selection: ... Allow each attribute value be selected by only one cluster ... Consider only local statistical values in attribute selection ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 50
Provided by: kevi60
Category:

less

Transcript and Presenter's Notes

Title: DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Pr


1
DB Seminar SeriesHARP A Hierarchical Algorithm
with Automatic Relevant Attribute Selection for
Projected Clustering
  • Presented by
  • Kevin Yip
  • 20 September 2002

2
Short Summary
  • Our own work (unpublished), supervised by Dr.
    Cheung and Dr. Ng
  • Problem to cluster datasets of very high
    dimensionality
  • Assumption clusters are formed in subspaces

3
Short Summary
  • Previous approaches either have special
    restrictions on the dataset or target clusters,
    or cannot determine the dimensionality of the
    clusters automatically
  • Our approach not restricted by these limitations

4
Presentation Outline
  • Clustering
  • Projected clustering
  • Previous approaches to projected clustering
  • Our approach HARP
  • Concepts
  • Implementation HARP.1
  • Experiments
  • Future work and conclusions

5
Clustering
  • Goal given a dataset D with N records and d
    attributes (dimensions), partition the records
    into k disjoint clusters such that
  • Intra-cluster similarity is maximized
  • Inter-cluster similarity is minimized

6
Clustering
  • How to measure similarity?
  • Distance-based Manhattan distance, Euclidean
    distance, etc.
  • Correlation-based cosine correlation, Pearson
    correlation, etc.
  • Link-based (common neighbors)
  • Pattern-based

7
Clustering
  • 2 common types of clustering algorithms
  • Partitional selects some representative points
    for each cluster, assigns all other points to
    their closest clusters, and then re-determines
    the new representative points
  • Hierarchical (agglomerative) repeatedly
    determines the two most similar clusters, and
    merges them

8
Clustering
  • Partitional clustering

9
Clustering
  • Hierarchical clustering

10
Projected Clustering
  • Assumption (general case) each cluster is formed
    in a subspace
  • Assumption (special case) each cluster has a set
    of relevant attributes
  • Goal determine the records and relevant
    attributes of each cluster (to select the
    relevant attributes. How to define relevance?)

11
Projected Clustering
  • A 3-D view

Source of figure DOC (SIGMOD 2002)
12
Projected Clustering
  • An example dataset

13
Projected Clustering
  • Projected clustering v.s. feature selection
  • Feature selection selects a feature set for all
    records, but projected clustering selects
    attribute sets individually for each cluster
  • Feature selection is a preprocessing task, but
    projected clustering selects attributes during
    the clustering process

14
Projected Clustering
  • Why projected clustering is important?
  • At high dimensionality, the data points are
    sparse, the distance between any two points is
    almost the same
  • There are many noise attributes that we are not
    interested in
  • High dimensionality implies high computational
    complexity

15
Previous Approaches
  • (Refer to my previous DB seminar on 17 May 2002
    titled The Subspace Clustering Problem)
  • Grid-based dimension selection (CLIQUE, ENCLUS,
    MAFIA)
  • Association rule hypergraph partitioning
  • Context-specific Bayesian clustering
  • Monte Carlo algorithm (DOC)
  • Projective clustering (PROCLUS, ORCLUS)

16
Previous Approaches
  • PROCLUS
  • Draw medoids
  • Determine neighbors
  • Select attributes
  • Assign records
  • Replace medoids
  • Goto 2
  • ORCLUS
  • Draw medoids
  • Assign records
  • Select vectors
  • Merge (reselect vectors and determine centroid)
  • Goto 2

17
Previous Approaches
  • Summary of the limitations of previous approaches
    (each approach has one or more of the
    followings)
  • Produce non-disjoint clusters
  • Has exponential time complexity w.r.t. cluster
    dimensionality
  • Allow each attribute value be selected by only
    one cluster
  • Unable to determine the dimensionality of each
    cluster automatically
  • Produce clusters all of the same dimensionality
  • Consider only local statistical values in
    attribute selection
  • Unable to handle datasets with mixed attribute
    types
  • Assign records to clusters regardless of their
    distances
  • Require datasets to have a lot more records than
    attributes

18
Our Approach HARP
  • Motivations
  • From datasets we want to study gene expression
    profile datasets (usually with thousands of genes
    and less than a hundred samples)
  • From previous algorithms we want to develop a
    new algorithm that does not have any of the above
    limitations

19
Our Approach HARP
  • HARP a Hierarchical algorithm with Automatic
    Relevant attribute selection for Projected
    clustering
  • Special features
  • Automatic attribute selection
  • Customizable procedures
  • Mutual disagreement prevention

20
Our Approach HARP
  • Special implementation based on attribute value
    density, HARP.1
  • Use of global statistics in attribute selection
  • Generic similarity calculations that can handle
    both categorical and numeric attributes
  • Implementing all mutual disagreement mechanisms
    defined by HARP
  • Reduced time complexity by pre-clustering

21
Our Approach HARP
  • Basic idea
  • In the partitional approaches
  • At the beginning, each record is assigned to a
    cluster by calculating distances/similarities
    using all attributes
  • Very likely that some assignments are incorrect
  • No clue to find the dimensionality of the
    clusters
  • Our approach
  • Allow only the best merges at any time

22
Out Approach HARP
  • Basic idea
  • Best a merge is permitted only if
  • Each selected attribute of the resulting cluster
    has a relevance of at least dt
  • The resulting cluster has more than mapc selected
    attributes
  • The two participating clusters have a mutual
    disagreement not larger than md
  • Mapc, dt, md threshold variables

23
Our Approach HARP
  • Multi-step clustering

d 1 imd
mapc
dt
md
1 g mmd
Initial thresholds
24
Our Approach HARP
  • Expected resulting clusters
  • Have all relevant attributes selected (due to
    mapc)
  • Selected attributes have high relevance to the
    cluster (due to dt)
  • Not biased by the participating clusters (due to
    md and some other mechanisms)

25
Our Approach HARP
  • More details attribute relevance
  • Depending on the definition of the similarity
    measure
  • E.g. the density-based measure defines the
    relevance of an attribute to a cluster by the
    compactness of its values in the cluster.
    Compactness can be reflected by the variance value

26
Our Approach HARP
  • More details attribute relevance
  • Which attributes are relevant to the clusters?
  • A1, A2 local statistics
  • A3, A4 global statistics

27
Our Approach HARP
  • More details mutual disagreement
  • The two clusters participating in a merge do not
    agree with each other

28
Our Approach HARP
  • More details mutual disagreement
  • Case 1

100 rec. A1, A2
105 rec. A1, A2
5 rec. A3, A4
  • One cluster dominates the selection of attributes

29
Our Approach HARP
  • More details mutual disagreement
  • Case 2

50 rec. A1, A2
100 rec. A1, A2
50 rec. A1, A2, ,A6
  • The clusters lose some information due to the
    merge

30
Our Approach HARP
  • More details mutual disagreement
  • Mutual disagreement prevention
  • Setup the md threshold to limit the maximum
    disagreement on the new set of attributes
  • Get the statistics of the loss of information in
    all possible merges, discard those with
    extraordinary high loss
  • Add a punishment factor to the similarity score

31
Our Approach HARP.1
  • HARP.1 an implementation of HARP that defines
    the relevance of an attribute to a cluster by its
    density improvement from the global density
  • Relevance score of an attribute to a cluster
  • Categorical 1 (1 Mode-ratiolocal) / (1
    Mode-ratioglobal)
  • Numeric 1 Varlocal / Varglobal
  • When Mode-ratioglobal 1 or Varglobal 0, the
    score 0
  • If C1 and C2 merge into Cnew, we can use the
    records of C1 and C2 to evaluate their
    agreement on the selected attributes of Cnew in
    a similar way.

32
Our Approach HARP.1
  • Mutual disagreement calculations
  • Den(Ci, a) how good is attribute a in Ci
  • Den(Ci, Cnew, a) how good is the attribute a in
    Ci, evaluated by using the properties of a in
    Cnew
  • Both values are in the range 0, 1

33
Our Approach HARP.1
  • Similarity score

34
Our Approach HARP.1
Initial and baseline values for the md variable
user parameters, default 10 and 50
Each cluster keeps a local score list (binary
tree) containing merges with all other clusters.
The best scores are propagated to a global score
list
  • Multi-step clustering

d 1 imd
mapc
dt
md
1 g mmd
  • With mutual disagreement prevention
  • MD(C1,C2) lt md
  • Sum of and difference between ILoss(C1,Cnew) and
    ILoss(C2,Cnew) not more than a certain s.d. from
    mean
  • Punishment factor in similarity score

Initial thresholds
Baseline value for each dt variable the global
statistical value
35
Our Approach HARP.1
  • Time complexity
  • Speeding up use a fast projected clustering
    algorithm to pre-cluster the data
  • Space complexity

36
Our Approach HARP.1
  • Accuracy experiments (datasets)

37
Our Approach HARP.1
  • Accuracy experiments (results1)

38
Our Approach HARP.1
  • Accuracy experiments (results2)

39
Our Approach HARP.1
  • Accuracy experiments (results3)
  • Dataset 500 records, 200 attributes, on average
    13 relevant, 5 classes
  • Pre-clustering form 50 clusters

40
Our Approach HARP.1
  • Scalability experiments (scaling N)

41
Our Approach HARP.1
  • Scalability experiments (scaling d)

42
Our Approach HARP.1
  • Scalability experiments (scaling average number
    of relevant attributes)

43
Our Approach HARP.1
  • Scalability experiments (scaling N with
    pre-clustering)

44
Our Approach HARP.1
  • Application gene expression datasets
  • Lymphoma Nature 403 (2000)
  • 96 samples, 4026 genes, 9 classes

45
Our Approach HARP.1
  • Application gene expression datasets
  • Can also use genes as records and samples as
    attributes
  • E.g. use the dendrogram to produce an ordering of
    all genes
  • Based on some domain knowledge, validate the
    ordering
  • If the ordering is valid, the position of other
    genes of unknown functions can be analyzed

46
Future Work
  • Produce more implementations based on other
    similarity measures
  • Study the definition of relevance in gene
    expression datasets
  • Consider very large datasets that cannot fit into
    main memory
  • Extend the approach to solve other problems, e.g.
    k-NN in high dimensional space

47
Conclusions
  • A hierarchical projected clustering algorithm,
    HARP, is developed with
  • Dynamic selection of relevant attributes
  • Mutual disagreement prevention
  • Generic similarity calculation
  • A density-based implementation called HARP.1 is
    developed with
  • Good accuracy
  • Reasonable time complexity
  • Real applications on gene expression datasets

48
References
  • C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S.
    Yu, and J. S. Park. Fast algorithms for projected
    clustering. In ACM SIGMOD International
    Conference on Management of Data, 1999.
  • C. C. Aggarwal and P. S. Yu. Finding generalized
    projected clusters in high dimensional spaces.
    pages 7081, 2000.
  • R. Agrawal, J. Gehrke, D. Gunopulos, and P.
    Raghavan. Automatic subspace clustering of high
    dimensional data for data mining applications. In
    ACM SIGMOD International Conference on Management
    of Data, 1998.
  • A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma,
    I. S. Lossos, A. Rosenwald, J. C. Boldrick, H.
    Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G.
    E. Marti, T. Moore, J. Hudson, L. Lu, D. B.
    Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T.
    C. Greiner, D. D.Weisenburger, J. O. Armitage, R.
    Warnke, R. Levy, W. Wilson, M. R. Grever, J. C.
    Byrd, D. Botstein, P. O. Brown, and L. M. Staudt.
    Distinct types of diuse large b-cell lymphoma
    identified by gene expression profiling. Nature,
    403(6769)503511, 2000.
  • Y. Barash and N. Friedman. Context-specific
    bayesian clustering for gene expression data. In
    Annual Conference on Research in Computational
    Molecular Biology, 2001.

49
References
  • C. H. Cheng, A. W.-C. Fu, and Y. Zhang.
    Entropy-based subspace clustering for mining
    numerical data. In Knowledge Discovery and Data
    Mining, pages 8493, 1999.
  • S. Guha, R. Rastogi, and K. Shim. ROCK A robust
    clustering algorithm for categorical attributes.
    In 15th International Conference on Data
    Engineering, 1999.
  • E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher.
    Clustering based on association rule hypergraphs.
    In 1997 SIGMOD Workshop on Research Issues on
    Data Mining and Knowledge Discovery, 1997.
  • G. Karypis, R. Aggarwal, V. Kumar, and S.
    Shekhar. Multilevel hypergraph partitioning
    Applications in VLSI domain. In ACM/IEEE Design
    Automation Conference, 1997.
  • H. Nagesh, S. Goil, and A. Choudhary. Ma a
    Efficient and scalable subspace clustering for
    very large data sets, 1999.
  • C. M. Procopiuc, M. Jones, P. K. Agarwal, and T.
    M. Murali. A monte carlo algorithm for fast
    projective clustering. In ACM SIGMOD
    International Conference on Management of Data,
    2002.
  • H. Wang, W. Wang, J. Yang, and P. S. Yu.
    Clustering by pattern similarity in large data
    sets. In ACM SIGMOD International Conference on
    Management of Data, 2002.
Write a Comment
User Comments (0)
About PowerShow.com