Clustering Web Pages: a critical literature review - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Clustering Web Pages: a critical literature review

Description:

Problem of conventional document retrieval systems. Low precision. Rank list presentation ... 2. STC (Suffix Tree Clustering) A novel, incremental, O(n) time algorithm ... – PowerPoint PPT presentation

Number of Views:752
Avg rating:3.0/5.0
Slides: 25
Provided by: csh96
Category:

less

Transcript and Presenter's Notes

Title: Clustering Web Pages: a critical literature review


1
Clustering Web Pages a critical literature
review
Weizheng Gao 2003/06/20
2
Outline
  • Introduction
  • STC (suffix tree) Algorithm
  • Grouper-A Clustering Engine
  • Reference

3
1. Introduction
  • Problem of conventional document retrieval
    systems
  • ? Low precision
  • ? Rank list presentation
  • How about off-line clustering?

4
An alternative Model
5
Requirements for Web document clustering methods
  • ? Relevance
  • ? Browsable Summaries
  • ? Overlap
  • ? Snippet-tolerance
  • ? Speed
  • ? Incrementality

6
2. STC (Suffix Tree Clustering)
  • A novel, incremental, O(n) time algorithm
  • Treats a document as a string
  • Relies on Suffix Tree to identify common phrases
  • Uses the common information to create clusters
  • Also uses this information to summarize the
    contents of clusters

7
What is Suffix Tree?
  • A suffix tree is a rooted, directed tree
  • Each internal node has at least 2 children
  • Each edge is labeled with a non-empty sub-string
    of S.
  • The label of a node is the concatenation of the
    edge-labels on the path from the root to that
    node.
  • No two edges out of the same node can have
    edge-labels that begin with the same word.
  • For each suffix s of S, there exists a
    suffix-node whose label equals s

8
An Example
I know you know I know
Trimming
9
Logical Steps
  • Step-1 Document Cleaning
  • Step-2 Identifying Base Clusters
  • Step-3 Combining Base Clusters
  • Step-4 Score clusters

10
Step-1 Document Cleaning
? Use a light stemming algorithm ? Mark sentence
boundaries ? Stripped non-word tokens
The original document strings are kept, as well
as pointers from the beginning of each word in
the transformed string to its position in the
original strings.
11
Step-2 Identifying Base Clusters
  • Strings
  • cat1 ate2 cheese3
  • 2. mouse1 ate2 cheese3 too4
  • 3. cat1 ate2 mouse3 too4

The first number designates the string of
origin. The second number designates which suffix
of that string labels that suffix-node.
12
The suffix tree of the strings cat ate cheese,
mouse ate cheese too and cat ate mouse too
13
Each node represents a base cluster
Table 1 Six nodes and their corresponding base
clusters
14
Each base cluster is assigned a score
The Score s(B) of base cluster B with phrase p is
given by
s(B) B f(P) ? tfidf(wi)
B is the number of documents in base cluster
B. P is the number of words in P. The function
f penalizes single word phrases, is linear for
phrase that are two to six words long, and
becomes constant for longer phrases. ? tfidf(wi)
is a sum of standard term frequency-inverse
document frequency term ranking factor for all
terms in phrase P.
15
Step-3 Combining Base Clusters
Binary similarity measure
The similarity of Bm and Bn to be 1 iff BmnBn
/ Bm gt a and BmnBn / Bn gt a Otherwise,
their similarity is 0.
The base cluster graph that a0.5
16
The phrase cluster graph
(a) for ? 0.7 there are four connected
components in the graph, representing four merged
clusters. (b) for ? 0.6 there is a single
connected component in the graph, representing
one merged cluster. (c) If the word ate had been
in stoplist, the phrase cluster b would have been
discarded as it would have had a score of 0, and
for ? 0.6 we would have had three connected
components in the graph, representing three
merged clusters.
17
Merged clusters as connected components in the
phrase cluster graph
18
Step-4 Score Clusters
Nc is the number of documents in cluster C. Only
consider labels l0 to ln that are in C and are
not subsets of any other label.
p(l) ? p(w)
P(w) log(1/fw) if fw gt0 and P(w) log(1/.5)
if fw0
19
The main advantage of STC
  • It is phrase-based
  • It does not adhere to any model of the data
  • STC uses a simple cluster definition
  • STC allows overlapping clusters
  • STC is a fast incremental, linear time algorithm

20
4. Grouper- A Clustering Engine
  • Grouper is a clustering interface to the
    HuskySearch meta-search service.
  • Grouper clusters the results as they arrive using
    the STC algorithm.

21
User interface
Groupers query interface. Users Neednt to enter
any parameters for the clustering algorithm
22
The main result page
The main results page in Grouper for the query
israel
23
Reference
  • Oren Zamir, Oren Etzioni, Omid Madani, Richard
    M.Karp, Fast and Intuitive Clustering of Web
    Documents, 1997, KDD
  • Oren Zamir, Oren Etzioni, Web Document
    Clustering A Feasibility Demonstration, In Proc.
    ACM SIGIR'98, 1998
  • Oren Zamir, Oren Etzioni, Grouper A Dynamic
    Clustering Interface to Web Search Results, WWW8
    1999
  • Steve Branson, Ari GreenBerg, Clustering Web
    Search Results Using Suffix Tree Methods,
    Stanford University

24
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com