Clustering Categorical Data The Case of Quran Verses - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Clustering Categorical Data The Case of Quran Verses

Description:

The holy Quran covers a wide range of topics. Quran does not cover each topic by a set of sequenced verses or sura's. ... P4={ fasting, prayer, pilgrimage} ... – PowerPoint PPT presentation

Number of Views:347
Avg rating:3.0/5.0
Slides: 44
Provided by: wat60
Category:

less

Transcript and Presenter's Notes

Title: Clustering Categorical Data The Case of Quran Verses


1
Clustering Categorical DataThe Case of Quran
Verses
  • Presented By
  • Muhammad Al-Watban
  • IS 598

2
Outline
  • Introduction
  • Preprocessing of Quran Verses
  • Similarity Measures
  • Assisting Clusters Similarities
  • Shortcomings of Traditional clustering methods
    with categorical data
  • ROCK - Major definitions
  • ROCK clustering Algorithm
  • ROCK example
  • Conclusion and future work

3
Introduction
  • The holy Quran covers a wide range of topics.
  • Quran does not cover each topic by a set of
    sequenced verses or suras.
  • A single verse usually deals with many subjects
  • Project goal to cluster the verses of The Holy
    Quran based on the verses subjects.

4
Preprocessing of Quran Verses
  • it is necessary to perform manual preprocessing
    for the Quran text to capture the subjects of the
    verses into a tabular format
  • Verses in the Holy Quran can be viewed as records
    and the related subjects as attributes of the
    record. This is demonstrated by the following
    table
  • The data in the above table is similar to what is
    known as market-basket data.
  • Here, we will call it verses-treasues data

5
Similarity Measures
  • Two types of attributes
  • Continuous attributes
  • range of attribute value is continuous and
    ordered
  • includes Attributes with numeric values (e.g.
    salary)
  • also includes attributes whose allowed set of
    values are thought to be part of an ordered set
    of a meaningful sequence (e.g. professional
    ranks, disease severity levels)
  • The similarity (or dissimilarity) between objects
    is computed based on distance between them.
  • the most commonly used distance measure is
    Euclidean distance, and Manhattan distance

6
Similarity Measures
  • Categorical attributes
  • consists of attributes whose underlying domain is
    not ordered
  • Examples colors, blood type.
  • If the attribute has only two states (namely 0
    and 1), then it is called binary if it has more
    than two states, it is called nominal.
  • there is no easy way to measure a distance
    between objects
  • We can define dissimilarity based on the simple
    matching approach
  • Where m is the number of matched attribute, and p
    is the total number of attributes.

7
Similarity Measures
  • Where does the verses treasures data fit?
  • Each verse can be represented by a record with
    Boolean attributes, each attribute corresponds to
    a single subject
  • The attribute corresponding to a subject is T if
    the verse contains that subjects otherwise, it
    is F
  • As we said, Boolean attributes are a special case
    of categorical attributes

8
Assisting Clusters Similarities
  • Many clustering algorithm(such as hirarchical
    clustering) requires computing distance between
    clusters (rather than elements)
  • There are several standard methods
  • 1- Single linkage
  • D(r,s) distance between clusters r and s is
    defined as the distance between the closest pair
    of objects

9
Assisting Clusters Similarities
  • 2. Complete linkage
  • distance is defined as the distance between the
    farthest pair of objects
  • 3. Average linkage
  • distance is defined as the average of distances
    between all pairs of objects r and s, where r and
    s belong to different clusters

10
Assisting Clusters Similarities
  • 4. Centroid Linkage  
  • distance between clusters is defined as the
    distance between the pair of cluster centroids.  

11
Shortcomings of Traditional clustering methods
with categorical data
  • Example
  • Consider the following 4 market basket
    transactions
  • T1 1, 2, 3, 4
  • T2 1, 2, 4
  • T3 3
  • T4 4
  • converting these transactions to Boolean points,
    we get
  • P1 (1, 1, 1, 1)
  • P2 (1, 1, 0, 1)
  • P3 (0, 0, 1, 0)
  • P4 (0, 0, 0, 1)
  • using Euclidean distance to measure the closeness
    between all pairs of points, we find that
    d(p1,p2) is the smallest distance

12
Shortcomings of Traditional clustering methods
with categorical data
  • If we use the centroid-based hierarchical
    algorithm then we merge P1 and P2 and get a new
    cluster (P12) with (1, 1, 0.5, 1) as a centroid
  • Then, using Euclidean distance again, we find
  • d(p12,p3) ?3.25
  • d(p12,p4) ?2.25
  • d(p3,p4) ?2
  • So, we should merge P3 and P4 since the distance
    between them is the shortest.
  • However, T3 and T4 don't have even a single
    common item.
  • So, using distance metrics as similarity measure
    for categorical data is not appropriate
  • The solution is ROCK

13
ROCK - Major definitions
  • Similarity function
  • Neighbors
  • Links
  • Criterion function
  • Goodness measure

14
Similarity function
  • Let Sim (Pi, Pj) be a similarity function that is
    used to measure the closeness between points pi
    and Pj.
  • ROCK assumes that Sim function is normalized to
    return a value between 0 and 1
  • For Quran treasures data, a possible definition
    for the sim function is based on the Jaccard
    coefficient

15
Example similarity function
  • Suppose two verses (P1 and P2) contain the
    following subjects
  • P1 judgment, faith, prayer, fair
  • P2 fasting, faith, prayer
  • Sim(P1,P2) P1? P2 / P1?P2
  • 2 / 5 0.40

16
Major definitions
  • Similarity for data objects
  • Neighbors
  • Links
  • Criterion function
  • Goodness measure

17
Neighbors and Links
  • one main problem of traditional clustering
    islocal properties involving only the two points
    are considered.
  • Neighbor
  • If similarity between two points exceeds certain
    similarity threshold (?), they are neighbors.
  • Link
  • The Link for pair of points is the number of
    their common neighbors.
  • Obviously, Link incorporates global information
    about the other points in the neighborhood of the
    two points. The larger the Link, the higher
    probability that this pair of points are in the
    same clusters.

18
Example neighboring and linking
  • Example
  • Assume that we have three distinct points p1,p2
    and p3 where
  • neighbor(p1)p1,p2
  • neighbor(p2)p1,p2,3
  • neighbor(p3)p3,p2
  • Neighboring graph ?
  • To define the number of links between two points,
    say p1 and p3, we have to find the number of
    their common neighbors hence, we can define the
    linkage function between p1 and p3 to be
  • Link (p1,p3) neighbor(p1) ? neighbor(p3)
    P2
  • Or Link (p1,p3) 1

19
Example minimum linkages
  • If we have four pointsP1,P2,P3,P4
  • suppose that similarity threshold (?) is equal to
    1
  • Then, Two Points are neighbors if sim(Pi,Pj)gt1
  • hence, points are considered neighbors only to
    identical points (i.e. only to themselves)
  • To find Link(P1,P2)
  • neighbor(P1)P1
  • neighbor(P2)P2
  • link (P1,P2) neighbor(p1) ? neighbor(p2) 0

20
  • The following table shows the number of links
    (common neighbors) between the four points
  • We can depict the neighboring graph

21
Example maximum linkages
  • If we have four pointsP1,P2,P3,P4
  • suppose that similarity threshold (?) is equal to
    0
  • Then, Two Points are neighbors if sim(Pi,Pj)gt0
  • hence, any pair of points are neighbors
  • To find Link(P1,P2)
  • neighbor(P1)P1,P2,P3,P4
  • neighbor(P2)P1,P2,P3,P4
  • link (P1,P2) neighbor(P1) ? neighbor(P2) 4

22
  • The following table shows the number of links
    (common neighbors) between the four points
  • We can depict the neighboring graph

23
Example illustrating links
  • from the previous example, we have
  • neighbor(P1)P1,P2,P3,P4
  • neighbor(P3)P1,P2,P3,P4
  • link (P1,P3) neighbor(P1) ? neighbor(P3) 4
    links
  • we can depict these four different links (or
    paths) through these four different neighbors as
    follows

24
Major definitions
  • Similarity for data objects
  • Neighbors
  • Links
  • Criterion function
  • Goodness measure

25
Criterion function
  • to get the best clusters, we have to maximize
    this Criterion Function
  • Where Ci denotes cluster i
  • ni is the number of points in Ci
  • k is the number of clusters
  • ? is the similarity threshold
  • Suppose in Ci, each point has roughly nf(?)
    neighbors.
  • A suitable choice for basket data is
    f(?)(1-?)/(1?)

26
Criterion function
  • By maximizing this criterion function, we are
    maximizing the sum of links of intra cluster
    point pairs and at the same time minimizing the
    sum of links among pairs of points belonging to
    different clusters (i.e. among inter cluster
    point pairs)

27
Major definitions
  • Similarity for data objects
  • Neighbors
  • Links
  • Criterion function
  • Goodness measure

28
Goodness measure
  • Goodness Function
  • During clustering, we use this goodness measure
    in order to maximize the criterion function.
  • This goodness measure helps to identify the best
    pair of clusters to be merged during each step of
    ROCK.

29
ROCK Clustering algorithm
  • Input A set S of data points
  • Number of k clusters to be found
  • The similarity threshold
  • Output Groups of clustered data
  • The ROCK algorithm is divided into three major
    parts
  • Draw a random sample from the data set
  • Perform a hierarchical agglomerative clustering
    algorithm
  • Label data on disk
  • in our case, we do not deal with a very huge data
    set. So, we will consider the whole data in the
    process of forming clusters, i.e. we skip step1
    and step3

30
ROCK Clustering algorithm
  • Draw a random sample from the data set
  • sampling is used to ensure scalability to very
    large data sets
  • The initial sample is used to form clusters, then
    the remaining data on disk is assigned to these
    clusters
  • in our case, we will consider the whole data in
    the process of forming clusters.

31
ROCK Clustering algorithm
  • Perform a hierarchical agglomerative clustering
    algorithm
  • ROCK performs the following steps which are
    common to all hierarchical agglomerative
    clustering algorithms, but with different
    definition to the similarity measures
  • places each single data point into a separate
    cluster
  • compute the similarity measure for all pairs of
    clusters
  • merge the two clusters with the highest
    similarity (goodness measure)
  • Verify a stop condition. If it is not met then go
    to step b

32
  • Label data on disk
  • Finally, the remaining data points in the disk
    are assigned to the generated clusters.
  • This is done by selecting a random sample Li from
    each cluster Ci, then we assign each point p to
    the cluster for which it has the strongest
    linkage with Li.
  • As we said, we will consider the whole data in
    the process of forming clusters.

33
ROCK Clustering algorithm
  • Computation of links
  • using the similarity threshold ?, we can convert
    the similarity matrix into an adjacency matrix
    (A)
  • Then we obtain a matrix indicating the number of
    links by calculating (A x A ) , i.e., by
    multiplying the adjacency matrix A with itself

34
ROCK Example
  • Suppose we have four verses contains some
    subjects , as follows
  • P1 judgment, faith, prayer, fair
  • P2 fasting, faith, prayer
  • P3 fair, fasting, faith
  • P4 fasting, prayer, pilgrimage
  • the similarity threshold 0.3, and number of
    required cluster is 2.
  • using Jaccard coefficient as a similarity
    measure, we obtain the following similarity table

35
ROCK Example
  • Since we have a similarity threshold equal to
    0.3, then we derive the adjacency table?
  • By multiplying the adjacency table with itself,
    we derive the following table which shows the
    number of links (or common neighbors) ?

36
ROCK Example
  • we compute the goodness measure for all adjacent
    points ,assuming that f(?) 1-? / 1?
  • we obtain the following table?
  • we have an equal goodness measure for merging
    ((P1,P2), (P2,P1), (P3,P1))

37
ROCK Example
  • Now, we start the hierarchical algorithm by
    merging, say P1 and P2.
  • A new cluster (lets call it C(P1,P2)) is formed.
  • It should be noted that for some other
    hierarchical clustering techniques, we will not
    start the clustering process by merging P1 and
    P2, since Sim(P1,P2) 0.4,which is not the
    highest. But, ROCK uses the number of links as
    the similarity measure rather than distance.

38
ROCK Example
  • Now, after merging P1 and P2, we have only three
    clusters. The following table shows the number of
    common neighbors for these clusters?
  • Then we can obtain the following goodness
    measures for all adjacent clusters?

39
ROCK Example
  • Since the number of required clusters is 2, then
    we finish the clustering algorithm by merging
    C(P1,P2) and P3, obtaining a new cluster
    C(P1,P2,P3) which contains P1,P2,P3 leaving P4
    alone in a separate cluster.

40
Conclusion and future work (1/3)
  • We aim to apply a clustering technique on the
    verses of the Holy Quran
  • We should first perform manual preprocessing for
    the Quran text to capture the subjects of the
    verses into a tabular format.
  • Then we can apply a clustering algorithm which
    clusters each set of similar verses into the same
    group.

41
Conclusion and future work (2/3)
  • Most traditional clustering algorithm uses
    distance based similarity measures which is not
    appropriate for clustering our categorical-type
    datasets.
  • we will apply the general framework of the ROCK
    algorithm.
  • The ROCK (RObust Clustering using linKs)
    algorithm is an agglomerative hierarchical
    clustering algorithm for clustering categorical
    data. It presents a new notion of link to measure
    similarity between data objects.

42
Conclusion and future work (3/3)
  • We will adopt JAVA language to implement ROCK
    clustering algorithm.
  • During testing, will try to form clusters of
    verses belonging to a single sura, and verses
    belonging to many different suras.
  • Insha Allah, we will achieve success in
    performing this mission.

43
  • Thank You for your attention
  • I will be glad to answer your questions
Write a Comment
User Comments (0)
About PowerShow.com