Clustering Web Pages: a critical literature review - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Clustering Web Pages: a critical literature review

Description:

Problem of conventional document retrieval systems. Low precision. Rank list presentation ... 2. STC (Suffix Tree Clustering) A novel, incremental, O(n) time algorithm ... – PowerPoint PPT presentation

Number of Views:752

Avg rating:3.0/5.0

Slides: 25

Provided by: csh96

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Web Pages: a critical literature review

1
Clustering Web Pages a critical literature
review
Weizheng Gao 2003/06/20
2
Outline

Introduction
STC (suffix tree) Algorithm
Grouper-A Clustering Engine
Reference

3
1. Introduction

Problem of conventional document retrieval
systems
? Low precision
? Rank list presentation
How about off-line clustering?

4
An alternative Model
5
Requirements for Web document clustering methods

? Relevance
? Browsable Summaries
? Overlap
? Snippet-tolerance
? Speed
? Incrementality

6
2. STC (Suffix Tree Clustering)

A novel, incremental, O(n) time algorithm
Treats a document as a string
Relies on Suffix Tree to identify common phrases
Uses the common information to create clusters
Also uses this information to summarize the
contents of clusters

7
What is Suffix Tree?

A suffix tree is a rooted, directed tree
Each internal node has at least 2 children
Each edge is labeled with a non-empty sub-string
of S.
The label of a node is the concatenation of the
edge-labels on the path from the root to that
node.
No two edges out of the same node can have
edge-labels that begin with the same word.
For each suffix s of S, there exists a
suffix-node whose label equals s

8
An Example
I know you know I know
Trimming
9
Logical Steps

Step-1 Document Cleaning
Step-2 Identifying Base Clusters
Step-3 Combining Base Clusters
Step-4 Score clusters

10
Step-1 Document Cleaning
? Use a light stemming algorithm ? Mark sentence
boundaries ? Stripped non-word tokens
The original document strings are kept, as well
as pointers from the beginning of each word in
the transformed string to its position in the
original strings.
11
Step-2 Identifying Base Clusters

Strings
cat1 ate2 cheese3
2. mouse1 ate2 cheese3 too4
3. cat1 ate2 mouse3 too4

The first number designates the string of
origin. The second number designates which suffix
of that string labels that suffix-node.
12
The suffix tree of the strings cat ate cheese,
mouse ate cheese too and cat ate mouse too
13
Each node represents a base cluster
Table 1 Six nodes and their corresponding base
clusters
14
Each base cluster is assigned a score
The Score s(B) of base cluster B with phrase p is
given by
s(B) B f(P) ? tfidf(wi)
B is the number of documents in base cluster
B. P is the number of words in P. The function
f penalizes single word phrases, is linear for
phrase that are two to six words long, and
becomes constant for longer phrases. ? tfidf(wi)
is a sum of standard term frequency-inverse
document frequency term ranking factor for all
terms in phrase P.
15
Step-3 Combining Base Clusters
Binary similarity measure
The similarity of Bm and Bn to be 1 iff BmnBn
/ Bm gt a and BmnBn / Bn gt a Otherwise,
their similarity is 0.
The base cluster graph that a0.5
16
The phrase cluster graph
(a) for ? 0.7 there are four connected
components in the graph, representing four merged
clusters. (b) for ? 0.6 there is a single
connected component in the graph, representing
one merged cluster. (c) If the word ate had been
in stoplist, the phrase cluster b would have been
discarded as it would have had a score of 0, and
for ? 0.6 we would have had three connected
components in the graph, representing three
merged clusters.
17
Merged clusters as connected components in the
phrase cluster graph
18
Step-4 Score Clusters
Nc is the number of documents in cluster C. Only
consider labels l0 to ln that are in C and are
not subsets of any other label.
p(l) ? p(w)
P(w) log(1/fw) if fw gt0 and P(w) log(1/.5)
if fw0
19
The main advantage of STC

It is phrase-based
It does not adhere to any model of the data
STC uses a simple cluster definition
STC allows overlapping clusters
STC is a fast incremental, linear time algorithm

20
4. Grouper- A Clustering Engine

Grouper is a clustering interface to the
HuskySearch meta-search service.
Grouper clusters the results as they arrive using
the STC algorithm.

21
User interface
Groupers query interface. Users Neednt to enter
any parameters for the clustering algorithm
22
The main result page
The main results page in Grouper for the query
israel
23
Reference

Oren Zamir, Oren Etzioni, Omid Madani, Richard
M.Karp, Fast and Intuitive Clustering of Web
Documents, 1997, KDD
Oren Zamir, Oren Etzioni, Web Document
Clustering A Feasibility Demonstration, In Proc.
ACM SIGIR'98, 1998
Oren Zamir, Oren Etzioni, Grouper A Dynamic
Clustering Interface to Web Search Results, WWW8
1999
Steve Branson, Ari GreenBerg, Clustering Web
Search Results Using Suffix Tree Methods,
Stanford University

24
Thanks!

Write a Comment

User Comments (0)